Collation version tracking for macOS

Started by Peter Eisentrautalmost 4 years ago148 messages

peter.eisentraut@enterprisedb.com

almost 4 years ago

1 attachment(s)

During development, I have been using the attached patch to simulate
libc collation versions on macOS. It just uses the internal major OS
version number. I don't know to what the extend the libc locales on
macOS are maintained or updated at all, so I don't know what practical
effect this would have. Again, it's mainly for development. If there
is interest from others, I think we could add this, maybe disabled by
default, or we just keep it in the mailing list archives for interested
parties.

Attachments:

0001-Collation-version-tracking-for-macOS.patchtext/plain; charset=UTF-8; name=0001-Collation-version-tracking-for-macOS.patchDownload

From 5e6f1a94a839981755f0380960511e60aba2b8d2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 1 Feb 2022 16:07:29 +0100
Subject: [PATCH] Collation version tracking for macOS

---
 src/backend/utils/adt/pg_locale.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 871a710967..f8e57ee236 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -74,6 +74,10 @@
 #include <gnu/libc-version.h>
 #endif
 
+#ifdef __APPLE__
+#include <sys/sysctl.h>
+#endif
+
 #ifdef WIN32
 #include <shlwapi.h>
 #endif
@@ -1696,6 +1700,28 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		else
 			ereport(ERROR,
 					(errmsg("could not load locale \"%s\"", collcollate)));
+#elif defined(__APPLE__)
+		/*
+		 * The POSIX-level locales on macOS are mostly useless for real work,
+		 * and they also don't appear to change much or at all.  However, at
+		 * least for development it is useful to have some version tracking
+		 * mechanism on this platform.  For lack of better ideas, we just
+		 * record the internal major operating system version.
+		 */
+		{
+			char		str[256];
+			size_t		len = sizeof(str);
+			char	   *p;
+
+			if (sysctlbyname("kern.osrelease", str, &len, NULL, 0) != 0)
+				ereport(ERROR,
+						(errmsg("could not get OS release: %m")));
+			/* value is three numbers like "12.3.4", we take only the first one */
+			p = strchr(str, '.');
+			if (p)
+				*p = '\0';
+			collversion = pstrdup(str);
+		}
 #elif defined(WIN32) && _WIN32_WINNT >= 0x0600
 		/*
 		 * If we are targeting Windows Vista and above, we can ask for a name
-- 
2.35.1

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Eisentraut (#1)

Re: Collation version tracking for macOS

On Mon, Feb 14, 2022 at 10:00 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

During development, I have been using the attached patch to simulate
libc collation versions on macOS. It just uses the internal major OS
version number. I don't know to what the extend the libc locales on
macOS are maintained or updated at all, so I don't know what practical
effect this would have. Again, it's mainly for development. If there
is interest from others, I think we could add this, maybe disabled by
default, or we just keep it in the mailing list archives for interested
parties.

Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version. I remember that other old
Unixes used to be like that, and I suspect that they might be using
old pre-UTF-8 FreeBSD code for locales based on a quick peek at [1]https://github.com/apple-open-source-mirror/Libc/tree/master/locale
(though FreeBSD itself has since learned to do CLDR-based UTF-8
sorting with a completely new implementation shared with other OSes).
This makes me wonder if Apple is hiding another collation
implementation somewhere up its sleeve -- surely that libc support is
not good enough for the world's shiny globalised macOS/iOS apps?
Maybe UCCompareText() and friends (UnicodeUtilitiesCoreLib) and the
various Obj-C NSString comparison stuff, all of which probably
predates Unixoid macOS (google tells me that UnicodeUtilities.h was
present in macOS 9). It wouldn't be surprising if it shares nothing
with the modern OS's C runtime stuff that came via NeXT. Just
mentioning this as a curiosity, because I was trying to figure out how
that could be left non-working without anyone complaining...

[1]: https://github.com/apple-open-source-mirror/Libc/tree/master/locale

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 3 years ago

In reply to: Thomas Munro (#2)

Re: Collation version tracking for macOS

On 07.05.22 02:31, Thomas Munro wrote:

During development, I have been using the attached patch to simulate
libc collation versions on macOS. It just uses the internal major OS
version number. I don't know to what the extend the libc locales on
macOS are maintained or updated at all, so I don't know what practical
effect this would have. Again, it's mainly for development. If there
is interest from others, I think we could add this, maybe disabled by
default, or we just keep it in the mailing list archives for interested
parties.

Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version.

Someone recently told me that collations in macOS have actually changed
recently and that this is a live problem. See explanation here:

https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66

So I think we should reconsider this patch, even for PG15.

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter Eisentraut (#3)

Re: Collation version tracking for macOS

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

On 07.05.22 02:31, Thomas Munro wrote:

Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version.

Someone recently told me that collations in macOS have actually changed
recently and that this is a live problem. See explanation here:

https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66

So I think we should reconsider this patch, even for PG15.

According to that document, they changed it in macOS 11, which came out
a year and a half ago. Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update).

regards, tom lane

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Tom Lane (#4)

Re: Collation version tracking for macOS

On 6/3/22 9:21 AM, Tom Lane wrote:

According to that document, they changed it in macOS 11, which came out
a year and a half ago. Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update).

Sorry, I'm going to rant for a minute... it is my very strong opinion
that using language like "false positive" here is misguided and dangerous.

If new version of sort order is released, for example when they recently
updated backwards-secondary sorting in french [CLDR-2905] or matching of
v and w in swedish and finnish [CLDR-7088], it is very dangerous to use
language like “false positive” to describe a database where there just
didn't happen to be any rows with accented french characters at the
point in time where PostgreSQL magically changed which version of sort
order it was using from the 2010 french version to the 2020 french version.

No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.

I have personally seen PostgreSQL databases where an update to an old
version of glibc was applied (I'm not even talking 2.28 here) and it
resulted in data loss b/c crash recovery couldn't replay WAL records and
the user had to do a PITR. That's aside from the more common issues of
segfaults or duplicate records that violate unique constraints or wrong
query results like missing data. And it's not just updates - people can
set up a hot standby on a different version and see many of these
problems too.

Collation versioning absolutely must be first class and directly
controlled by users, and it's very dangerous to allow users - at all -
to take an index and then use a different version than what the index
was built with.

Not to mention all the other places in the DB where collation is used...
partitioning, constraints, and any other place where persisted data can
make an assumption about any sort of string comparison.

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

-Jeremy Schneider

--
http://about.me/jeremy_schneider

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Eisentraut (#3)

Re: Collation version tracking for macOS

On Sat, Jun 4, 2022 at 12:17 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 07.05.22 02:31, Thomas Munro wrote:

Last time I looked into this it seemed like macOS's strcoll() gave
sensible answers in the traditional single-byte encodings, but didn't
understand UTF-8 at all so you get C/strcmp() order. In other words
there was effectively nothing to version.

Someone recently told me that collations in macOS have actually changed
recently and that this is a live problem. See explanation here:

https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66

How can I see evidence of this? I'm comparing Debian, FreeBSD and
macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
/usr/share/dict/words" I get upper and lower case mixed together on
the other OSes, but on the Mac the upper case comes first, which is my
usual smoke test for "am I looking at binary sort order?"

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Jeremy Schneider (#5)

Re: Collation version tracking for macOS

On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

No other piece of software that calls itself a database would do what
PostgreSQL is doing: just give users a "warning" after suddenly changing
the sort order algorithm (most users won't even read warnings in their
logs). Oracle, DB2, SQL Server and even MySQL carefully version
collation data, hardcode a pseudo-linguistic collation into the DB (like
PG does for timezones), and if they provide updates to linguistic sort
order (from Unicode CLDR) then they allow the user to explicitly specify
which version of french or german ICU sorting they are want to use.
Different versions are treated as different sort orders; they are not
conflated.

I guess you know this but for the record, there have been discussions
before about supporting multiple versions of ICU concurrently, like
DB2. For example, one idea was that each ICU version could be a
separate "provider" in PostgreSQL, so you can concurrently use
multiple versions.

One of several places this came up:
/messages/by-id/CADE5jYJTnYaTNXMFKOK-0p44+Dm5LMcRcJ5kVi1MVHomb2QTkQ@mail.gmail.com

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#7)

Re: Collation version tracking for macOS

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.

Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.

regards, tom lane

Jim Nasby

nasbyj@amazon.com

over 3 years ago

In reply to: Tom Lane (#8)

Re: Collation version tracking for macOS

On 6/3/22 3:58 PM, Tom Lane wrote

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.

Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.

The first step to a solution is admitting that the problem exists.
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override). A more sophisticated
option would be to provide the machinery for supporting multiple
collation libraries. Both of those at least ensure that users are aware
any time there's a problem, which IMO is *enormously* better than
letting core functionality silently stop working.

#10

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Jim Nasby (#9)

Re: Collation version tracking for macOS

Jim Nasby <nasbyj@amazon.com> writes:

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously.

The first step to a solution is admitting that the problem exists.
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override).

You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has). Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.

regards, tom lane

#11

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Jim Nasby (#9)

Re: Collation version tracking for macOS

On Jun 6, 2022, at 17:10, Jim Nasby <nasbyj@amazon.com> wrote:
Ignoring broken backups, segfaults and data corruption as a "rant" implies that we simply throw in the towel and tell users to suck it up or switch engines.

Well now, let’s be clear, I was the one who called my email a “rant”. 🙂

And I do apologize for that - it was grumpy and impulsive and Tom isn’t wrong that rants don’t usually help move things forward.

Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material; I haven’t read the whole thread yet. If you have some others that would also be particularly good background, let me know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I was pulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years, so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in the conversation here.

Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn on version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is pretty damn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things to happen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades of large & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I have a lot of reading to do, to really understand how that happened and where the dialogue is today.

Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but I still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough” for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?)

Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for the non-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>).

-Jeremy

Sent from my TI-83

#12

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Jim Nasby (#9)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 12:10 PM Jim Nasby <nasbyj@amazon.com> wrote:

On 6/3/22 3:58 PM, Tom Lane wrote

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

It feels to me like we're still not really thinking clearly about this
within the PG community, and that the seriousness of this issue is not
fully understood.

FWIW A couple of us tried quite hard to make smarter warnings, and
that thread and others discussed a lot of those topics, like the
relevance to constraints and so forth.

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.

Short of maintaining our own fork, ranting about the imperfections
of the situation is a waste of time.

The first step to a solution is admitting that the problem exists.

We've been discussing this topic for years and I don't think anyone
thinks the case is closed...

Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override).

Yeah, it's been discussed, but never proposed. The problem is that
you need to start up to fix the problem. Another option is not to use
affected indexes, but that doesn't help with other forms of the
problem (partition constraints, etc).

A more sophisticated
option would be to provide the machinery for supporting multiple
collation libraries.

Earlier I mentioned distinct "providers" but I take that back, that's
too complicated. Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions? Or some
cleverer, shorter notation. Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not. For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with. You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK. Maintaining our own fork(s) of ICU would seem like massive
overkill and I don't think anyone has suggested that; the question on
my mind is whether we could rely on existing packages. Then you'd be
exposed only to changes that happen within (say) the ICU 63 package's
lifetime... I recall looking into whether that can happen but ... I
don't recall the answer.

#13

Rod Taylor

rbt@rbt.ca

over 3 years ago

In reply to: Tom Lane (#10)

Re: Collation version tracking for macOS

On Mon, Jun 6, 2022 at 8:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jim Nasby <nasbyj@amazon.com> writes:

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously.

The first step to a solution is admitting that the problem exists.
Ignoring broken backups, segfaults and data corruption as a "rant"
implies that we simply throw in the towel and tell users to suck it up
or switch engines. There are other ways to address this short of the
community doing all the work itself. One simple example would be to
refuse to start if the collation provider has changed since initdb
(which we'd need to allow users to override).

You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has). Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.

Is this more involved than creating a list of all valid Unicode characters
(~144 thousand), sorting them, then running crc32 over the sorted order to
create the "version" for the library/collation pair? Far from free but few
databases use more than a couple different collations.

--
Rod Taylor

#14

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Rod Taylor (#13)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:

Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.

Collation rules have multiple levels and all kinds of quirks, so that
won't work.

#15

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#8)

Re: Collation version tracking for macOS

On Fri, Jun 3, 2022 at 4:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the real problem here is that the underlying software mostly
doesn't take this issue seriously. Unfortunately, that leads one to
the conclusion that we need to maintain our own collation code and
data (e.g., our own fork of ICU), and that isn't happening. Unlike
say Oracle, we do not have the manpower; nor do we want to bloat our
code base that much.

You don't, but that opinion isn't universally held, or at least not
with the same vigor that you hold it. See e.g.
/messages/by-id/a4019c5e570d4dbb5e3f816c080fb57c76ab604a.camel@cybertec.at
and subsequent discussion, for example.

In fact, I'd go so far as to argue that you're basically sticking your
head in the sand here. You wrote:

"Given the lack of complaints, it doesn't seem
like this is urgent enough to mandate a post-beta change that would
have lots of downside (namely, false-positive warnings for every other
macOS update)."

But you wrote that to Peter, who was essentially complaining that we
hadn't done anything, and linked to another source, which was also
complaining about the problem, and then Jeremy Schneider replied to
your email and complained some more.

Complaining about "false positives" doesn't really make sense to me.
It's true that we don't have any false positives right now, but we
also have no true positives. Even a stopped clock is right twice a
day, but not in a useful way. People want to be notified when a
problem might exist, even if sometimes it doesn't actually. The
alternative is having no idea at all that things might be broken,
which is not better.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#14)

Re: Collation version tracking for macOS

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:

Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.

Collation rules have multiple levels and all kinds of quirks, so that
won't work.

Yeah, and it's exactly at the level of quirks that things are likely
to change. Nobody's going to suddenly start sorting B before A.
They might, say, change their minds about where the digram "cz"
sorts relative to single letters, in languages where special rules
for that are a thing.

The idea of fingerprinting a collation's behavior is interesting,
but I've got doubts about whether we can make a sufficiently thorough
fingerprint.

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#15)

Re: Collation version tracking for macOS

Robert Haas <robertmhaas@gmail.com> writes:

In fact, I'd go so far as to argue that you're basically sticking your
head in the sand here. You wrote:

No, I quite agree that we have a problem. What I don't agree is that
issuing a lot of false-positive warnings is a solution. That will
just condition people to ignore the warnings, and then when their
platform really does change behavior, they're still screwed. If we
could *accurately* report collation behavioral changes, I'd be all
for that.

Rod's idea upthread is certainly way too simplistic, but could we
build a set of test cases that do detect known changes in collation
behaviors? We'd be shooting at a moving target; but even if we're
late in noticing that platform X changed the behavior of collation Y,
we could help users who run in the problem afterwards.

regards, tom lane

#18

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Robert Haas (#15)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote:

It's true that we don't have any false positives right now, but we
also have no true positives. Even a stopped clock is right twice a
day, but not in a useful way. People want to be notified when a
problem might exist, even if sometimes it doesn't actually.

Collations by their very nature are unlikely to change all that much.
Obviously they can and do change, but the details are presumably
pretty insignificant to a native speaker. Stands to reason that the
issue (which is fundamentally a problem for natural language experts)
would have been resolved far sooner if there really was a significant
controversy about something that tends to come up often.

It's pretty clear that glibc as a project doesn't take the issue very
seriously, because they see it as a problem of the GUI sorting a table
in a way that seems slightly suboptimal to scholars of a natural
language. Clearly that isn't actually a big deal. But the latent
possibility of wrong answers to queries is a very big deal. Both are
true. It's just a matter of priorities in each case.

I agree that "false positive" is not a valid way of describing a
breaking change in a Postgres collation that happens to not affect one
index in particular, due to the current phase of the moon. It's
probably very likely that most individual indexes that we warn about
will be so-called false positives. I bet Postgres that there are many
near-misses that we never get to hear about already. That's rather
beside the point. The index must be assumed to be corrupt.

--
Peter Geoghegan

#19

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter Geoghegan (#18)

Re: Collation version tracking for macOS

Peter Geoghegan <pg@bowt.ie> writes:

I agree that "false positive" is not a valid way of describing a
breaking change in a Postgres collation that happens to not affect one
index in particular, due to the current phase of the moon. It's
probably very likely that most individual indexes that we warn about
will be so-called false positives.

This is not the concern that I have. I agree that if we tell a user
that collation X changed behavior and he'd better reindex his indexes
that use collation X, but none of them actually contain any cases that
changed behavior, that's not a "false positive" --- that's "it's cheaper
to reindex than to try to identify whether there's a problem". What
I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never. We will soon lose those users' attention ---
see fable about boy crying wolf --- and then when Apple actually does
change something, we've got a problem. So if we give collation-change
warnings, they'd better have some measurable connection to reality.

regards, tom lane

#20

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#12)

Re: Collation version tracking for macOS

On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Earlier I mentioned distinct "providers" but I take that back, that's
too complicated. Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions? Or some
cleverer, shorter notation. Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not. For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with. You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK.

Right. Postgres could link to multiple versions of ICU at the same
time. Right now it doesn't, and right now the ICU C symbol names that
we use are actually versioned (this isn't immediately apparent because
the C preprocessor makes it appear that ICU symbol names are generic).

We could perhaps invent a new indirection that knows about
multiple ICU versions, each of which is an independent collation
provider, or maybe a related collation provider that gets used by
default on REINDEX. ICU is designed for this kind of thing. That
approach more or less puts packagers on the hook for managing
collation stability. But now long term collation stability is at least
feasible -- we at least have a coherent strategy. In the worst case
the community .deb and .rpm repos might continue to support an older
ICU version, or lobby for its continued support by the distro (while
actively discouraging its use in new databases). This isn't the same
thing as forking ICU. It's a compromise between that extreme, and
the current situation.

--
Peter Geoghegan

#21

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#16)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 7:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The idea of fingerprinting a collation's behavior is interesting,
but I've got doubts about whether we can make a sufficiently thorough
fingerprint.

On one of the many threads about this I recall posting a thought
experiment patch that added system_collation_version_command or some
such, so you could train your computer to compute a hash for
/usr/share/locale/XXX/LC_COLLATE (or whatever it's called on your
system), but it all seemed a bit gross to me on various levels. Most
people don't know or care about collations so they won't set it up, so
to make it useful it'd have to have useful defaults, and it seems like
a bad idea to teach PostgreSQL where all these systems keep their
collation rules.

#22

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Peter Geoghegan (#18)

Re: Collation version tracking for macOS

On 6/7/22 12:53 PM, Peter Geoghegan wrote:

Collations by their very nature are unlikely to change all that much.
Obviously they can and do change, but the details are presumably
pretty insignificant to a native speaker.

This idea does seem to persist. It's not as frequent as timezones, but
collation rules reflect local dialects and customs, and there are
changes quite regularly for a variety of reasons. A brief perusal of
CLDR changelogs and CLDR jiras can give some insight here:

https://github.com/unicode-org/cldr

https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22collation%22%20ORDER%20BY%20created%20DESC

The difference between the unicode consortium and the GNU C Library is
that unicode is maintained by people who are specifically interested in
working with language and internationalization challenges. I've spoken
to a glibc maintainer who directly told me that they dislike working
with the collation code, and try to avoid it. It's not even ISO 14651
anymore with so many custom glibc-specific changes layered on top. I
looked at the first few commits in the glibc source that were
responsible for the big 2.28 changes - there were a serious of quite a
few commits and some were so large they wouldn't even load in the github
API.

Here's one such commit:

https://github.com/bminor/glibc/commit/9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4

It's reasonable to expect that Red Hat and Debian will keep things
stable on one particular major, and to expect that every new major OS
version will update to the latest collation algorithms and locale data
for glibc.

Another misunderstanding that seems to persist is that this only relates
to exotic locales or that it's only the 2.28 version.

My github repo is out-of-date (I know of more cases that I still need to
publish) but the old data already demonstrates changes to the root/DUCET
collation rules (evident in en_US without any tailoring) for glibc
versions 2.13, 2.21 and 2.26

https://github.com/ardentperf/glibc-unicode-sorting/

If a PosgreSQL user is unlucky enough to have one of those unicode
characters stored in a table, they can get broken indexes even if they
only use the default US english locale, and without touching glibc 2.28
- and all you need is an index on a field where end users can type any
string input.

It's pretty clear that glibc as a project doesn't take the issue very
seriously, because they see it as a problem of the GUI sorting a table
in a way that seems slightly suboptimal to scholars of a natural
language.

I disagree that glibc maintainers are doing anything wrong.

While the quality of glibc collations aren't great when compared with
CLDR, I think the glibc maintainers have done versioning exactly right:
they are clear about which patches are allowed to contain collation
updates, and the OS distributions are able to ensure stability on major
OS release. I haven't yet found a Red Hat minor release that changed
glibc collation.

-Jeremy

--
http://about.me/jeremy_schneider

#23

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#19)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is not the concern that I have. I agree that if we tell a user
that collation X changed behavior and he'd better reindex his indexes
that use collation X, but none of them actually contain any cases that
changed behavior, that's not a "false positive" --- that's "it's cheaper
to reindex than to try to identify whether there's a problem". What
I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never.

That does seem like a meaningful distinction. I'm sorry if I
misrepresented your position on this.

We're talking about macOS here, which is hardly a paragon of lean
software. I think that it's worth revisiting the assumption that the C
standard library collations are the most useful set of collations, and
we shouldn't presume to know better than the operating system.
Couldn't individual packagers establish their own system for managing
collations across multiple ICU versions, as I outlined up-thread?

I think that it's okay (maybe unavoidable) that we keep "lib C
collations are authoritative" as a generic assumption when Postgres is
built from source. We can still have defacto standards that apply on
all mainstream platforms when users install standard packages for
production databases -- I don't see why we can't do both. Maybe the
best place to solve this problem is at the level of each individual
package ecosystem.

There can be some outsourcing to package managers this way, without
relying on the underlying OS, or lib C collations, or ICU in general.
This scheme wouldn't technically be under our direct control, but
would still be something that we could influence. We could have a back
and forth conversation about what's not working in the field.

--
Peter Geoghegan

#24

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeremy Schneider (#22)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

This idea does seem to persist. It's not as frequent as timezones, but
collation rules reflect local dialects and customs, and there are
changes quite regularly for a variety of reasons. A brief perusal of
CLDR changelogs and CLDR jiras can give some insight here:

Another misunderstanding that seems to persist is that this only relates
to exotic locales or that it's only the 2.28 version.

I'm not defending the status quo, and I think that I'm better informed
than most about the problems in this area. My point was that it hardly
matters that we don't necessarily see outright corruption. This was
based in part on a misunderstanding of Tom's point, though.

While the quality of glibc collations aren't great when compared with
CLDR, I think the glibc maintainers have done versioning exactly right:
they are clear about which patches are allowed to contain collation
updates, and the OS distributions are able to ensure stability on major
OS release. I haven't yet found a Red Hat minor release that changed
glibc collation.

That might be true, but my impression from interacting with Carlos
O'Donnell is that they pretty much don't take the concern about
stability all that seriously. Which I think is reasonable, given his
position!

The fact that we are this sensitive to glibc collation versioning might
be a wholly unique situation (unlike with ICU, which was built with
that in mind). It might be that every other user of glibc collations
sees this as fairly inconsequential, because they don't have to deal
with persistent state that directly relies on the rules in various
ways that are critically important. Even if glibc theoretically does a
perfect job of versioning, I still think that their priorities are
very much unlike our priorities, and that that should be a relevant
consideration for us.

--
Peter Geoghegan

#25

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Jeremy Schneider (#22)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 4:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

I haven't yet found a Red Hat minor release that changed
glibc collation.

I feel like this is a thing that happens regularly enough that it's
known to be a gotcha by many of my colleagues here at EDB.

Perhaps that's all pure fiction, but I doubt it. People don't go
around making up stories about things being broken so they can say bad
things about Red Hat. They got told by customers that things are
broken and then go try to figure out how that happened.

--
Robert Haas
EDB: http://www.enterprisedb.com

#26

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Peter Geoghegan (#24)

Re: Collation version tracking for macOS

On 6/7/22 1:51 PM, Peter Geoghegan wrote:

On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

This idea does seem to persist. It's not as frequent as timezones, but
collation rules reflect local dialects and customs, and there are
changes quite regularly for a variety of reasons. A brief perusal of
CLDR changelogs and CLDR jiras can give some insight here:

Another misunderstanding that seems to persist is that this only relates
to exotic locales or that it's only the 2.28 version.

I'm not defending the status quo, and I think that I'm better informed
than most about the problems in this area. My point was that it hardly
matters that we don't necessarily see outright corruption. This was
based in part on a misunderstanding of Tom's point, though.

I think I was guilty of the same misunderstanding - apologies Tom!
Thanks Peter for calling that out explicitly.

For my for my part, gut feeling is that MacOS major releases will be
similar to any other OS major release, which may contain updates to
collation algorithms and locales. ISTM like the same thing PG is looking
for on other OS's to trigger the warning. But it might be good to get an
official reference on MacOS, if someone knows where to find one? (I don't.)

-Jeremy

--
http://about.me/jeremy_schneider

#27

Bruce Momjian

bruce@momjian.us

over 3 years ago

In reply to: Tom Lane (#16)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 03:43:32PM -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote:

Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.

Collation rules have multiple levels and all kinds of quirks, so that
won't work.

Yeah, and it's exactly at the level of quirks that things are likely
to change. Nobody's going to suddenly start sorting B before A.
They might, say, change their minds about where the digram "cz"
sorts relative to single letters, in languages where special rules
for that are a thing.

The idea of fingerprinting a collation's behavior is interesting,
but I've got doubts about whether we can make a sufficiently thorough
fingerprint.

Rather than trying to figure out if the collations changed, have we ever
considered checking if index additions and lookups don't match the OS
collation and reporting these errors somehow?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#28

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeremy Schneider (#26)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 2:13 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

For my for my part, gut feeling is that MacOS major releases will be
similar to any other OS major release, which may contain updates to
collation algorithms and locales. ISTM like the same thing PG is looking
for on other OS's to trigger the warning. But it might be good to get an
official reference on MacOS, if someone knows where to find one? (I don't.)

I just don't think that we should be relying on a huge entity like
Apple or even glibc for this -- they don't share our priorities, and
there is no reason for this to change. The advantage of ICU versioning
is that it is just one library, that can coexist with others,
including other versions of ICU.

Imagine a world in which we support multiple ICU versions (for Debian
packages, say), some of which are getting quite old. Maybe we can
lobby for the platform to continue to support that old version of the
library -- there ought to be options. Lobbying Debian to stick with an
older version of glibc is another matter entirely. That has precisely
zero chance of ever succeeding, for reasons that are quite
understandable.

Half the problem here is to detect breaking changes, but the other
half is to not break anything in the first place. Or to give the user
plenty of opportunity to transition incrementally, without needing to
reindex everything at the same time. Obviously the only way that's
possible is by supporting multiple versions of ICU at the same time,
in the same database. This requires indirection that distinguishes
between "physical and logical" collation versions, where the same
nominal collation can have different implementations across multiple
ICU versions.

The rules for standards like BCP47 (the system that defines the name
of an ICU/CLDR locale) are deliberately very tolerant of what they
accept in order to ensure forwards and backwards compatibility in
environments where there isn't just one ICU/CLDR version [1]https://www.ietf.org/rfc/rfc6067.txt -- Peter Geoghegan (most
environments in the world of distributed or web applications). So you
can expect the BCP47 name of a collation to more or less work on any
ICU version, perhaps with some loss of functionality (this is
unavoidable when you downgrade ICU to a version that doesn't have
whatever CLDR customization you might have relied on). It's very
intentionally a "best effort" approach, because throwing a "locale not
found" error message usually isn't helpful from the point of view of
the end user. Note that this is a broader standard than ICU or CLDR or
even Unicode.

[1]: https://www.ietf.org/rfc/rfc6067.txt -- Peter Geoghegan
--
Peter Geoghegan

#29

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#20)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 8:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Earlier I mentioned distinct "providers" but I take that back, that's
too complicated. Reprising an old idea that comes up each time we
talk about this, this time with some more straw-man detail: what about
teaching our ICU support to understand "libicu18n.so.71:en" to mean
that it should dlopen() that library and use its functions? Or some
cleverer, shorter notation. Then it's the user's problem to make sure
the right libraries are installed, and it'll fail if they're not. For
example, on Debian bookworm right now you can install libicu63,
libicu67, libicu71, though only the "current" -dev package, but which
I'm sure we can cope with. You're at the mercy of the distro or
add-on package repos to keep a lot of versions around, but that seems
OK.

Right. Postgres could link to multiple versions of ICU at the same
time. Right now it doesn't, and right now the ICU C symbol names that
we use are actually versioned (this isn't immediately apparent because
the C preprocessor makes it appear that ICU symbol names are generic).

Yeah, it's possible to link against multiple versions in theory and
that might be a way to do it if we were shipping our own N copies of
ICU like DB2 does, but that's hard in practice for shared libraries on
common distros (and vendoring or static linking of such libraries was
said to be against many distros' rules, since it would be a nightmare
if everyone did that, though I don't have a citation for that). I
suspect it's better to use dlopen() to load them, because (1) I
believe that the major distros only have -dev/-devel packages for the
"current" version, even though they let you install the packages
containing the .so files for multiple versions at the same time so
that binaries linked against older versions keep working and (2) I
think it'd be cool if users were free to find more ICU versions in
add-on package repos and be able to use them to get a version that the
packager of PostgreSQL didn't anticipate.

We could perhaps invent a new indirection that knows about
multiple ICU versions, each of which is an independent collation
provider, or maybe a related collation provider that gets used by
default on REINDEX. ICU is designed for this kind of thing. That
approach more or less puts packagers on the hook for managing
collation stability. But now long term collation stability is at least
feasible -- we at least have a coherent strategy. In the worst case
the community .deb and .rpm repos might continue to support an older
ICU version, or lobby for its continued support by the distro (while
actively discouraging its use in new databases). This isn't the same
thing as forking ICU. It's a compromise between that extreme, and
the current situation.

Yeah, I've flip-flopped a couple of times on the question of whether
ICU63 and ICU67 should be different collation providers, or
individual collations should somehow specify the library they want to
use (admittedly what I showed above with a raw library name is pretty
ugly and some indirection scheme might be nice). It would be good to
drill into the pros and cons of those two choices. As for getting
sane defaults, I don't know if this is a good idea, but it's an idea:
perhaps schemas and search paths could be used, you avoid having to
include ugly version strings in the collation identifiers, and the
search path effectively controls default when you don't want to be
explicit (= most users)?

#30

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#29)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Yeah, it's possible to link against multiple versions in theory and
that might be a way to do it if we were shipping our own N copies of
ICU like DB2 does, but that's hard in practice for shared libraries on
common distros (and vendoring or static linking of such libraries was
said to be against many distros' rules, since it would be a nightmare
if everyone did that, though I don't have a citation for that).

I'm not saying that it's going to be easy, but I can't see why it
should be impossible. I use Debian unstable for most of my work. It
supports multiple versions of LLVM/clang, not just one (though there
is a virtual package with a default version, I believe). What's the
difference, really?

Packaging standards certainly matter, but they're not immutable laws
of the universe. It seems reasonable to suppose that the people that
define these standards would be willing to hear us out -- this is
hardly a trifling matter, or something that only affects a small
minority of *their* users.

We don't need to support a huge number of versions on each OS -- just
enough to make it feasible for everybody to avoid the need to ever
reindex every index on a collatable type (maybe ICU versions that were
the default for the last several major versions of the OS are
available through special packages). We don't necessarily have to have
a hard dependency on every supported version from the point of view of
the package manager. And all of this would ultimately be the
responsibility of each individual packager; they'd need to figure out
how to make it work within the context of the platform that they're
targeting. We'd facilitate that important work, but would defer to
them on the final details. There could be a hands-off approach to the
whole thing, so it wouldn't be a total departure from what we do
today.

Yeah, I've flip-flopped a couple of times on the question of whether
ICU63 and ICU67 should be different collation providers, or
individual collations should somehow specify the library they want to
use (admittedly what I showed above with a raw library name is pretty
ugly and some indirection scheme might be nice). It would be good to
drill into the pros and cons of those two choices.

I think that there are pretty good technical reasons why each ICU
version is tied to a particular version of CLDR. Implementing CLDR
correctly and efficiently is a rather difficult process, even if we
ignore figuring out what natural language rules make sense. And so
linking to multiple different ICU versions doesn't really seem like
overkill to me. Or if it is then I can easily think of far better
examples of software bloat. Defining "stable behavior for collations"
as "uses exactly the same software artifact over time" is defensive
(compared to always linking to one ICU version that does it all), but
we have plenty that we need to defend against here.

--
Peter Geoghegan

#31

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#30)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 10:59 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Yeah, it's possible to link against multiple versions in theory and
that might be a way to do it if we were shipping our own N copies of
ICU like DB2 does, but that's hard in practice for shared libraries on
common distros (and vendoring or static linking of such libraries was
said to be against many distros' rules, since it would be a nightmare
if everyone did that, though I don't have a citation for that).

I'm not saying that it's going to be easy, but I can't see why it
should be impossible. I use Debian unstable for most of my work. It
supports multiple versions of LLVM/clang, not just one (though there
is a virtual package with a default version, I believe). What's the
difference, really?

The difference is that Debian has libllvm-{11,12,13,14}-dev packages,
but it does *not* have multiple -dev packages for libicu, just a
single libicu-dev which can be used to compile and link against their
chosen current library version. They do have multiple packages for
the actual .so and allow them to be installed concurrently.
Therefore, you could install N .sos and dlopen() them, but you *can't*
write a program that compiles and links against N versions at the same
time using their packages (despite IBM's work to make that possible,
perhaps for use in their own databases).

Packaging standards certainly matter, but they're not immutable laws
of the universe. It seems reasonable to suppose that the people that
define these standards would be willing to hear us out -- this is
hardly a trifling matter, or something that only affects a small
minority of *their* users.

OK, yeah, I'm thinking within the confines of things we can do easily
right now on existing systems as they are currently packaging software
only by changing our code, not "tell Debian to change their packaging
so we can compile and link against N versions". Supposing Debian
maintainers (and all the others) agreed, there'd still something else
in favour of dlopen(): wouldn't it be nice if the users were not
limited by the versions that the packager of PostgreSQL decided to
link against? What if someone has a good reason to want to use ICU
versions that are older than Debian currently ships, that are easily
available in add-on repos?

Yeah, I've flip-flopped a couple of times on the question of whether
ICU63 and ICU67 should be different collation providers, or
individual collations should somehow specify the library they want to
use (admittedly what I showed above with a raw library name is pretty
ugly and some indirection scheme might be nice). It would be good to
drill into the pros and cons of those two choices.

I think that there are pretty good technical reasons why each ICU
version is tied to a particular version of CLDR. Implementing CLDR
correctly and efficiently is a rather difficult process, even if we
ignore figuring out what natural language rules make sense. And so
linking to multiple different ICU versions doesn't really seem like
overkill to me. Or if it is then I can easily think of far better
examples of software bloat. Defining "stable behavior for collations"
as "uses exactly the same software artifact over time" is defensive
(compared to always linking to one ICU version that does it all), but
we have plenty that we need to defend against here.

I think we're not understanding each other here: I was talking about
the technical choice of whether we'd model the multiple library
versions in our catalogues as different "collprovider" values, or
somehow encode them into the "collcollate" string, or something else.
I'm with you, I'm already sold on the mult-library concept (and have
been in several previous cycles of this recurring discussion), which
is why I'm trying to move to discussing nuts and bolts and packaging
and linking realities that apparently stopped any prototype from
appearing last time around.

#32

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#31)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 4:29 PM Thomas Munro <thomas.munro@gmail.com> wrote:

The difference is that Debian has libllvm-{11,12,13,14}-dev packages,
but it does *not* have multiple -dev packages for libicu, just a
single libicu-dev which can be used to compile and link against their
chosen current library version. They do have multiple packages for
the actual .so and allow them to be installed concurrently.
Therefore, you could install N .sos and dlopen() them, but you *can't*
write a program that compiles and links against N versions at the same
time using their packages (despite IBM's work to make that possible,
perhaps for use in their own databases).

I know that glibc has various facilities for versioning dynamic
libraries, which includes ways to control symbol visibility. It's
possible that IBM's work on ICU versioning didn't just build on a
generic facility like that because that approach wasn't sufficiently
portable, particularly with platforms like AIX. It's also possible
that we won't have any of these same requirements, and can feasibly
link against multiple ICU versions some other way, and ultimately
achieve the same result -- multiple versions of ICU that can be used
by Postgres at the same time, with long term stable collations across
major OS and Postgres versions.

I now understand that you agree with me on this basic and important
point. Must have been a miscommunication.

Packaging standards certainly matter, but they're not immutable laws
of the universe. It seems reasonable to suppose that the people that
define these standards would be willing to hear us out -- this is
hardly a trifling matter, or something that only affects a small
minority of *their* users.

OK, yeah, I'm thinking within the confines of things we can do easily
right now on existing systems as they are currently packaging software
only by changing our code, not "tell Debian to change their packaging
so we can compile and link against N versions".

There are lots of specifics here, and I'm certainly not an expert on
packaging. IMV our approach doesn't necessarily need to use the same
original canonical package, though. It just needs to provide a
reasonably smooth experience for users that actually need to keep
their old collations working on upgrade. Either way, the process needs
to be something where all parties understand the concerns of each
other.

Of course Debian doesn't support linking against multiple versions of
ICU right now; why would they? Is there any reason to think that even
one person ever asked about it? Our interest in doing that will
probably be totally unique from their point of view. Can we just ask
somebody about it that has a deep understanding of these things?

Supposing Debian
maintainers (and all the others) agreed, there'd still something else
in favour of dlopen(): wouldn't it be nice if the users were not
limited by the versions that the packager of PostgreSQL decided to
link against? What if someone has a good reason to want to use ICU
versions that are older than Debian currently ships, that are easily
available in add-on repos?

I don't consider the ability to support many versions of ICU for the
sake of ICU features to be much of an advantage. I mostly just care
about the simple, common case where a user upgrades and doesn't want
to REINDEX immediately. You may well be right about dlopen(); I just
don't know right now.

I think we're not understanding each other here: I was talking about
the technical choice of whether we'd model the multiple library
versions in our catalogues as different "collprovider" values, or
somehow encode them into the "collcollate" string, or something else.

ISTM that there are two mostly-distinct questions here:

1. How do we link to multiple versions of ICU at the same time, in a
way that is going to work smoothly on mainstream platforms?

2. What semantics around collations do we want for Postgres once we
gain the ability to use multiple versions of ICU at the same time? For
example, do we want to generalize the definition of a collation, so
that it's associated with one particular ICU version and collation for
the purposes of on-disk compatibility, but isn't necessarily tied to
the same ICU version in other contexts, such as on a dump and restore?

--
Peter Geoghegan

#33

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#32)

1 attachment(s)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

ISTM that there are two mostly-distinct questions here:

1. How do we link to multiple versions of ICU at the same time, in a
way that is going to work smoothly on mainstream platforms?

2. What semantics around collations do we want for Postgres once we
gain the ability to use multiple versions of ICU at the same time? For
example, do we want to generalize the definition of a collation, so
that it's associated with one particular ICU version and collation for
the purposes of on-disk compatibility, but isn't necessarily tied to
the same ICU version in other contexts, such as on a dump and restore?

Yeah. Well I couldn't resist doing some (very!) experimental hacking.
See attached. The idea of putting a raw library name in there is just
a straw-man, and I already found a major problem with it: I also need
to get my hands on u_strToLower and friends for formatting.c, but
those functions are in a different library that needs to be dlopen'd
separately, so we need *two* names. That's not done in the attached
patch, but at least this demonstrates some of the mechanics of a
dlopen() based solution that can do the collating part... of course
there are all kinds of problems apparent (security of loading
arbitrary libraries, API stability, interaction with the "default" ICU
that our binary is linked against, creation of initial set of
collations in initdb, naming, upgrades, ...).

Demo:

$ sudo apt-get install libicu63 libicu67

postgres=# create schema icu63;
CREATE SCHEMA
postgres=# create schema icu67;
CREATE SCHEMA
postgres=# create collation icu63."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.63:en-US');
CREATE COLLATION
postgres=# create collation icu67."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.67:en-US');
CREATE COLLATION
postgres=# select collname, collnamespace::regnamespace,
colliculocale, collversion
from pg_collation
where collname = 'en-US-x-icu';
collname | collnamespace | colliculocale | collversion
-------------+---------------+------------------------+-------------
en-US-x-icu | pg_catalog | en-US | 153.14
en-US-x-icu | icu63 | libicui18n.so.63:en-US | 153.88
en-US-x-icu | icu67 | libicui18n.so.67:en-US | 153.14
(3 rows)
postgres=# select relname from pg_class order by relname collate
icu63."en-US-x-icu" limit 2;
relname
---------------------------
_pg_foreign_data_wrappers
_pg_foreign_servers
(2 rows)

Attachments:

0001-WIP-allow-multiple-ICU-libraries.patchtext/x-patch; charset=US-ASCII; name=0001-WIP-allow-multiple-ICU-libraries.patchDownload

From 5622f25172881e021d0f436add8a785f9e3445e5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH] WIP: allow multiple ICU libraries

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  16 +--
 src/backend/utils/adt/pg_locale.c  | 209 +++++++++++++++++++++++++++--
 src/backend/utils/adt/varchar.c    |  16 +--
 src/backend/utils/adt/varlena.c    |  47 +++----
 src/include/utils/pg_locale.h      |  47 +++++++
 5 files changed, 284 insertions(+), 51 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..c1847149de 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+														 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+												 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+														 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+												 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index a0490a7522..3a8951fe46 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -78,6 +78,10 @@
 #include <shlwapi.h>
 #endif
 
+#ifdef HAVE_DLOPEN
+#include <dlfcn.h>
+#endif
+
 #define		MAX_L10N_DATA		80
 
 
@@ -1435,29 +1439,204 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Linked list of ICU libraries we have loaded. */
+static pg_icu_library *icu_library_list = NULL;
+
+static void
+free_icu_library(pg_icu_library *l)
+{
+	if (l->handle)
+		dlclose(l->handle);
+	if (l->name)
+		pfree(l->name);
+	pfree(l);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char name[80];
+
+	snprintf(name, sizeof(name), "%s_%d", function, version);
+
+	return dlsym(handle, name);
+}
+
+static int
+get_icu_library_version(const char *name, void *handle)
+{
+	/*
+	 * Probe until we find the suffix being used.  Since we don't want to use
+	 * libraries that are newer than the one we are compiled against (newer
+	 * libraries might make incompatible API changes), we'll search from a
+	 * known good ancient version number up to the version we were compiled
+	 * with.
+	 */
+	for (int i = 54; i <= U_ICU_VERSION_MAJOR_NUM; ++i)
+		if (get_icu_function(handle, "ucol_open", i))
+			return i;
+
+	/* Either it's a later version we don't dare use, or not ICU. */
+	return -1;
+}
+
+/*
+ * Given a library name, return the object we need to call its functions.
+ */
+static pg_icu_library *
+get_icu_library(const char *name)
+{
+	pg_icu_library *l;
+
+	/* Try to find it in our list of existing libraries. */
+	for (l = icu_library_list; l; l = l->next)
+	{
+		if (name == NULL && l->name == NULL)
+			return l;
+		if (name != NULL && l->name != NULL && strcmp(name, l->name) == 0)
+			return l;
+	}
+
+	/* Make a new entry. */
+	l = MemoryContextAllocZero(TopMemoryContext, sizeof(*l));
+	if (name != NULL) {
+#ifdef HAVE_DLOPEN
+		int version;
+
+		l->name = MemoryContextStrdup(TopMemoryContext, name);
+		l->handle = dlopen(name, RTLD_NOW | RTLD_GLOBAL);
+		if (l->handle == NULL)
+		{
+			int errno_save = errno;
+			free_icu_library(l);
+			errno = errno_save;
+
+			ereport(ERROR,
+					(errmsg("could not load library \"%s\": %m", name)));
+		}
+
+		/* Figure out which major version this is. */
+		version = get_icu_library_version(name, l->handle);
+		if (version < 0)
+		{
+			free_icu_library(l);
+			ereport(ERROR,
+					(errmsg("could not find compatible ICU version in library \"%s\"",
+							name)));
+		}
+		l->version = version;
+
+		/* Look up all the functions we need. */
+		l->open = get_icu_function(l->handle, "ucol_open", version);
+		l->close = get_icu_function(l->handle, "ucol_close", version);
+		l->getVersion = get_icu_function(l->handle, "ucol_getVersion", version);
+		l->versionToString = get_icu_function(l->handle, "u_versionToString",
+											  version);
+		l->strcoll = get_icu_function(l->handle, "ucol_strcoll", version);
+		l->strcollUTF8 = get_icu_function(l->handle, "ucol_strcollUTF8",
+										  version);
+		l->getSortKey = get_icu_function(l->handle, "ucol_getSortKey",
+										 version);
+		l->nextSortKeyPart = get_icu_function(l->handle, "ucol_nextSortKeyPart",
+											  version);
+		l->errorName = get_icu_function(l->handle, "u_errorName", version);
+		if (!l->open ||
+			!l->close ||
+			!l->getVersion ||
+			!l->versionToString ||
+			!l->strcoll ||
+			!l->strcollUTF8 ||
+			!l->getSortKey ||
+			!l->nextSortKeyPart ||
+			!l->errorName)
+		{
+			free_icu_library(l);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in library \"%s\"",
+							name)));
+		}
+#else
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("loading extra ICU libraries is not supported in this build")));
+#endif
+	}
+	else
+	{
+		/* Use the library we were compiled and linked with. */
+		l->open = ucol_open;
+		l->close = ucol_close;
+		l->getVersion = ucol_getVersion;
+		l->versionToString = u_versionToString;
+		l->strcoll = ucol_strcoll;
+		l->strcollUTF8 = ucol_strcollUTF8;
+		l->getSortKey = ucol_getSortKey;
+		l->nextSortKeyPart = ucol_nextSortKeyPart;
+		l->errorName = u_errorName;
+	}
+	l->next = icu_library_list;
+	icu_library_list = l;
+
+	return l;
+}
+
+/*
+ * Look up the library to use for a given collcollate string.
+ */
+static pg_icu_library *
+get_icu_library_for_collation(const char *collcollate, const char **rest)
+{
+	char prefix[MAXPGPATH];
+	char *separator;
+	size_t prefix_len;
+
+	separator = strchr(collcollate, ':');
+
+	/* If it's a traditional value without a prefix, use default library. */
+	if (separator == NULL)
+	{
+		*rest = collcollate;
+		return get_icu_library(NULL);
+	}
+
+	/* If it has a prefix, we look that library up. */
+	prefix_len = separator - collcollate;
+	strncpy(prefix, collcollate, prefix_len);
+	prefix[prefix_len] = 0;
+	*rest = separator + 1;
+	return get_icu_library(prefix);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *l;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	l = get_icu_library_for_collation(iculocstr, &iculocstr);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = l->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, l->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	if (l->version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.funcs = l;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1688,21 +1867,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *l;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		l = get_icu_library_for_collation(collcollate, &collcollate);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = l->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, l->errorName(status))));
+		l->getVersion(collator, versioninfo);
+		l->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		l->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1770,6 +1951,8 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1991,19 +2174,21 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *l;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	l = get_icu_library_for_collation(icu_locale, &icu_locale);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = l->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, l->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	if (l->version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	l->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index bbeb0a2653..4815275ff5 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1025,11 +1025,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+														 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+												 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1086,11 +1086,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+														 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			mylocale->info.icu.funcs->getSortKey(mylocale->info.icu.ucol,
+												 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 919138eaf3..f933ec0de0 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1666,10 +1666,10 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = mylocale->info.icu.funcs->strcollUTF8(mylocale->info.icu.ucol,
+																   arg1, len1,
+																   arg2, len2,
+																   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("collation failed: %s", u_errorName(status))));
@@ -1685,9 +1685,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = mylocale->info.icu.funcs->strcoll(mylocale->info.icu.ucol,
+															   uchar1, ulen1,
+															   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2389,10 +2389,10 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = sss->locale->info.icu.funcs->strcollUTF8(sss->locale->info.icu.ucol,
+																  a1p, len1,
+																  a2p, len2,
+																  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
 							(errmsg("collation failed: %s", u_errorName(status))));
@@ -2408,9 +2408,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = sss->locale->info.icu.funcs->strcoll(sss->locale->info.icu.ucol,
+															  uchar1, ulen1,
+															  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2574,21 +2574,22 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uiter_setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize =
+						sss->locale->info.icu.funcs->nextSortKeyPart(sss->locale->info.icu.ucol,
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
 										u_errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = sss->locale->info.icu.funcs->getSortKey(sss->locale->info.icu.ucol,
+																	uchar, ulen,
+																	(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e7385faef8..da533fcb91 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -74,6 +74,52 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	char	   *name;			/* if loaded with dlopen() */
+	void	   *handle;			/* if loaded with dlopen() */
+	int			version;		/* major version of ICU */
+	UCollator *(*open)(const char *loc, UErrorCode *status);
+	void (*close)(UCollator *coll);
+	void (*getVersion)(const UCollator *coll, UVersionInfo info);
+	void (*versionToString)(const UVersionInfo versionArray,
+							char *versionString);
+	UCollationResult (*strcoll)(const UCollator *coll,
+								const UChar *source,
+								int32_t sourceLength,
+								const UChar *target,
+								int32_t targetLength);
+	UCollationResult (*strcollUTF8)(const UCollator *coll,
+									const char *source,
+									int32_t sourceLength,
+									const char *target,
+									int32_t targetLength,
+									UErrorCode *status);
+	int32_t (*getSortKey)(const UCollator *coll,
+						  const UChar *source,
+						  int32_t sourceLength,
+						  uint8_t *result,
+						  int32_t resultLength);
+	int32_t (*nextSortKeyPart)(const UCollator *coll,
+							   UCharIterator *iter,
+							   uint32_t state[2],
+							   uint8_t *dest,
+							   int32_t count,
+							   UErrorCode *status);
+	const char *(*errorName)(UErrorCode code);
+/*
+	UVersionInfo (*versioninfo)
+*/
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -95,6 +141,7 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *funcs;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-- 
2.30.2

#34

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#17)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

No, I quite agree that we have a problem. What I don't agree is that
issuing a lot of false-positive warnings is a solution. That will
just condition people to ignore the warnings, and then when their
platform really does change behavior, they're still screwed. If we
could *accurately* report collation behavioral changes, I'd be all
for that.

I mean, how many false-positive warnings do you think we'll get?

I would argue that if we put out something that's wrong half the time
-- it tells you about all the real problems and an equal number of
imaginary ones -- we'd be way ahead of where we are right now. If on
the other hand we put out something that's wrong 99% of the time -- it
tells you about all the real problems and ninety-nine times as many
imaginary ones -- that's worse than useless.

There can be some weasel wording in the language e.g. "WARNING: glibc
has been updated, collation definitions may have changed". It's worth
keeping in mind that the user doesn't necessarily have another source
of information that is more accurate than what we're providing. If
they REINDEX somewhat more often than is really necessary, that may be
painful, but it can still be a lot better than having queries return
wrong answers. If it's not, nobody's forcing them to issue that
REINDEX command.

--
Robert Haas
EDB: http://www.enterprisedb.com

#35

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#19)

Re: Collation version tracking for macOS

On Tue, Jun 7, 2022 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never.

Do we actually know that to be true? Given how fast things seem to be
getting added to Unicode, it wouldn't surprise me at all if they're
updating their Unicode tables for new characters with some regularity,
if nothing else, and that's a breaking change for us.

--
Robert Haas
EDB: http://www.enterprisedb.com

#36

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#34)

Re: Collation version tracking for macOS

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jun 7, 2022 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

No, I quite agree that we have a problem. What I don't agree is that
issuing a lot of false-positive warnings is a solution.

I mean, how many false-positive warnings do you think we'll get?

The proposed patch would result in a warning about every collation-
sensitive index during every macOS major version upgrade, ie about
once a year for most people. Seeing that Apple only actually touch
their POSIX collations once a decade or so, that's way too far over
on the crying-wolf end of the scale for me. We need something that
has at least *some* connection to actual changes.

regards, tom lane

#37

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#35)

Re: Collation version tracking for macOS

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jun 7, 2022 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I mean by "false positive" is telling every macOS user that they'd better
reindex everything every year, when in point of fact Apple changes those
collations almost never.

Do we actually know that to be true? Given how fast things seem to be
getting added to Unicode, it wouldn't surprise me at all if they're
updating their Unicode tables for new characters with some regularity,
if nothing else, and that's a breaking change for us.

Their POSIX collations seem to be legacy code that's entirely unrelated to
any modern collation support; in particular the "UTF8" ones are that in
name only. I'm sure that Apple are indeed updating the UTF8 data behind
their proprietary i18n APIs, but the libc APIs are mostly getting benign
neglect.

Maybe the report that started this thread indicates that this is changing,
but I'll believe that when I see it.

regards, tom lane

#38

Daniel Verite

daniel@manitou-mail.org

over 3 years ago

In reply to: Tom Lane (#16)

Re: Collation version tracking for macOS

Tom Lane wrote:

Yeah, and it's exactly at the level of quirks that things are likely
to change. Nobody's going to suddenly start sorting B before A.
They might, say, change their minds about where the digram "cz"
sorts relative to single letters, in languages where special rules
for that are a thing.

Independently of these rules, all Unicode collations change frequently
because each release of Unicode adds new characters. Any string
that contains a code point that was previously unassigned is going
to be sorted differently by all collations when that code point gets
assigned to a character.
Therefore the versions of all collations need to be bumped at every
Unicode release. This is what ICU does.

If the libc in macOS doesn't follow Unicode, that's not relevant
to macOS, but let's assume an OS that tries to be up-to-date.
If major OS upgrades happen every year or less frequently,
each OS upgrade is likely to imply an upgrade of all the collations,
since the interval between Unicode releases tends to be a year
or less:
https://www.unicode.org/history/publicationdates.html

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#39

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#37)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Their POSIX collations seem to be legacy code that's entirely unrelated to
any modern collation support; in particular the "UTF8" ones are that in
name only. I'm sure that Apple are indeed updating the UTF8 data behind
their proprietary i18n APIs, but the libc APIs are mostly getting benign
neglect.

I find that easy to believe. It's consistent with the overall picture
of Apple not caring about the POSIX collations beyond the basic
requirement for compatibility. ISTM that their totally inefficient
approach to implementing strxfrm() is another example of the same
thing. (The Apple strxfrm() produces huge low entropy binary strings,
unlike the glibc version, which is pretty well optimized.)

--
Peter Geoghegan

#40

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Daniel Verite (#38)

Re: Collation version tracking for macOS

"Daniel Verite" <daniel@manitou-mail.org> writes:

Independently of these rules, all Unicode collations change frequently
because each release of Unicode adds new characters. Any string
that contains a code point that was previously unassigned is going
to be sorted differently by all collations when that code point gets
assigned to a character.
Therefore the versions of all collations need to be bumped at every
Unicode release. This is what ICU does.

I'm very skeptical of this process as being a reason to push users
to reindex everything in sight. If U+NNNN was not a thing last year,
there's no reason to expect that it appears in anyone's existing data,
and therefore the fact that it sorts differently this year is a poor
excuse for sounding time-to-reindex alarm bells.

I'm quite concerned that we are going to be training users to ignore
collation-change warnings. They have got to be a lot better targeted
than this, or we're just wasting everyone's time, including ours.

regards, tom lane

#41

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Tom Lane (#40)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 4:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm very skeptical of this process as being a reason to push users
to reindex everything in sight. If U+NNNN was not a thing last year,
there's no reason to expect that it appears in anyone's existing data,
and therefore the fact that it sorts differently this year is a poor
excuse for sounding time-to-reindex alarm bells.

That seems completely wrong to me. It's not like a new character shows
up and people wait to start using it until it makes its way into
everyone's collation data. That is emphatically not what happens, I
would say. What happens is that people upgrade their libc packages at
one times and their postgres packages at another time, and it's
unlikely that they have any idea which order they do or did those
things. Meanwhile, people start using all the latest emojis. The idea
that the average PostgreSQL user has any idea whether a certain emoji
shows up in the data set for the first time before or after they
install the libc version that knows about it seems absurd. We don't
even know how to figure out which emojis the installed libc supports
-- if we did, we could reject data that we don't know how to sort
properly instead of ending up with corrupted indexes later. The user
has no more ability to figure it out than we do, and even if they did,
they probably wouldn't want to compare their stream of input data to
their collate definitions using some process external to the database.

--
Robert Haas
EDB: http://www.enterprisedb.com

#42

Mark Dilger

mark.dilger@enterprisedb.com

over 3 years ago

In reply to: Tom Lane (#19)

Re: Collation version tracking for macOS

On Jun 7, 2022, at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is not the concern that I have. I agree that if we tell a user
that collation X changed behavior and he'd better reindex his indexes
that use collation X, but none of them actually contain any cases that
changed behavior, that's not a "false positive" --- that's "it's cheaper
to reindex than to try to identify whether there's a problem".

I don't see this problem as limited to indexes, though I do understand why that might be the most common place for the problem to manifest itself.

As a simple example, text[] constructed using array_agg over sorted data can be corrupted by a collation change, and reindex won't fix it.

If we extend the table-AM interface to allow query quals to be pushed down to the table-AM, we might develop table-AMs that care about sort order, too.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#37)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm sure that Apple are indeed updating the UTF8 data behind
their proprietary i18n APIs, but the libc APIs are mostly getting benign
neglect.

As for how exactly they might be doing that, I don't know, but a bit
of light googling tells me that a private, headerless,
please-don't-call-me-directly copy of ICU arrived back in macOS
10.3[1]https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html. I don't see it on my 12.4 system, but I also know that 12.x
started hiding system libraries completely (the linker is magic and
pulls libraries from some parallel dimension, there is no
/usr/lib/libSystem.B.dylib file on disk, and yet otool -L
<your_favourite_executable> references it).

It's a lovely client machine, but I don't know if anyone really runs
meaningful database server stuff on macOS. I think if I did I'd be
very keen to use ICU for everything directly, rather than trying to
unpick any of that and talk to Apple's API... I think the
how-to-support-multiple-ICUs subrant/subthread is a much more
interesting topic. I have no idea if the dlopen() concept I mentioned
is the right way forward, but FWIW the experimental patch I posted
seems to work just fine on a Mac, using multiple ICU libraries
installed by MacPorts, which might be useful to developers
contemplating that stuff.

[1]: https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html

#44

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Thomas Munro (#43)

Re: Collation version tracking for macOS

New emoji are getting added with some frequency, it’s a thing lately…

New Unicode chars use existing but unassigned code points. All code points are able to be encoded, claimed or unclaimed.

Someone on old glibc or ICU can still store the new characters. As long as there’s an input field. You wouldn’t believe some stuff I’ve seen people enter in the “name” field for web apps… 🙄 It’ll get some undefined or default sort behavior for unrecognized or unassigned code points.

When the libs are updated, those new chars begin to sort correctly, which is a change and breaks indexes (and potentially other stuff).

-Jeremy

Sent from my TI-83

Show quoted text

On Jun 8, 2022, at 16:34, Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm sure that Apple are indeed updating the UTF8 data behind
their proprietary i18n APIs, but the libc APIs are mostly getting benign
neglect.

As for how exactly they might be doing that, I don't know, but a bit
of light googling tells me that a private, headerless,
please-don't-call-me-directly copy of ICU arrived back in macOS
10.3[1]. I don't see it on my 12.4 system, but I also know that 12.x
started hiding system libraries completely (the linker is magic and
pulls libraries from some parallel dimension, there is no
/usr/lib/libSystem.B.dylib file on disk, and yet otool -L
<your_favourite_executable> references it).

It's a lovely client machine, but I don't know if anyone really runs
meaningful database server stuff on macOS. I think if I did I'd be
very keen to use ICU for everything directly, rather than trying to
unpick any of that and talk to Apple's API... I think the
how-to-support-multiple-ICUs subrant/subthread is a much more
interesting topic. I have no idea if the dlopen() concept I mentioned
is the right way forward, but FWIW the experimental patch I posted
seems to work just fine on a Mac, using multiple ICU libraries
installed by MacPorts, which might be useful to developers
contemplating that stuff.

[1] https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html

#45

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Thomas Munro (#33)

Re: Collation version tracking for macOS

On Jun 8, 2022, at 03:19, Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 8, 2022 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

ISTM that there are two mostly-distinct questions here:

1. How do we link to multiple versions of ICU at the same time, in a
way that is going to work smoothly on mainstream platforms?

Yeah. Well I couldn't resist doing some (very!) experimental hacking.
See attached.

Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old version and replacing it with the new? Can it be done without downtime? Can it be done without modifying a running application? Avoiding “collate” clauses on SQL statements requires working behind the scenes with defaults and indexes and partitions and constraints and everything else. I’m having a hard time coming up with a way this would be possible in practice, with all the places collations can show up.

Is the idea of “alter database” to change the default collation even realistic?

I’m having a bit of trouble picturing what the end game is here

-Jeremy

Sent from my TI-83

#46

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeremy Schneider (#45)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old version and replacing it with the new?

They simply REINDEX, without changing anything. The details are still
fuzzy, but at least that's what I was thinking of.

This should be possible by generalizing the definition of a collation
to recognize that different ICU versions can support the same
collation. Of course we'd also have to remember which actual ICU
version and specific "physical collation" was currently in use by each
index. We'd also probably have to have some policy about which ICU
version was the latest (or some suitably generalized version of that
that applies to collation providers more generally).

Can it be done without downtime? Can it be done without modifying a running application?

Clearly the only way that we can ever transition to a new "physical
collation" is by reindexing using a newer ICU version. And clearly
there is going to be a need to fully deprecate any legacy version of
ICU on a long enough timeline. There is just no getting around that.

The advantage of an approach along the lines that I've laid out is
that everything can be done incrementally, possibly some time after an
initial OS or Posgres upgrade, once everything has settled. Much much
later, even. If the same new ICU version isn't available in your
original/old environment (which is likely), you can avoid reindexing,
and so reserve the option of backing out of a complex upgrade until
very late in the process. You're going to have to do it eventually,
but it can probably just be an afterthought.

--
Peter Geoghegan

#47

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Peter Geoghegan (#46)

Re: Collation version tracking for macOS

On Wed, Jun 8, 2022 at 10:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

They simply REINDEX, without changing anything. The details are still
fuzzy, but at least that's what I was thinking of.

As I said before, BCP47 format tags are incredibly forgiving by
design. So it should be reasonable to assume that anything that has
worked in an earlier version of ICU will continue to work in a way
that's at least as useful in a future version. See:

/messages/by-id/CAH2-Wz=ZrA5Yf55pKtdJb2pYCVN=2dh__VGR9arQqOHMqWgQPg@mail.gmail.com

That's not strictly guaranteed, because sometimes countries cease to
exist, and their ISO country codes eventually go away too. But that
still tends to fail gracefully. It's mostly only relevant for things
that are part of a locale, which is a broader concept than just
collation. An application that did this and relied on ICU for
localization might then find that the currency sign changed, but I'm
not aware of any impact on locales. You can ask for total nonsense
and mostly get reasonable behaviors, like Japanese as spoken in
Iceland. Even some totally made up (or misspelled) country is
accepted without complaint.

--
Peter Geoghegan

#48

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Peter Geoghegan (#46)

Re: Collation version tracking for macOS

On Jun 8, 2022, at 22:40, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old version and replacing it with the new?

They simply REINDEX, without changing anything. The details are still
fuzzy, but at least that's what I was thinking of.

Can it be done without downtime? Can it be done without modifying a running application?

Clearly the only way that we can ever transition to a new "physical
collation" is by reindexing using a newer ICU version. And clearly
there is going to be a need to fully deprecate any legacy version of
ICU on a long enough timeline. There is just no getting around that.

I’m probably just going to end up rehashing the old threads I haven’t read yet…

One challenge with this approach is you have things like sort-merge joins that require the same collation across multiple objects. So I think you’d need to keep all the old indexes around until you have new indexes available for all objects in a database, and somehow the planner would need to be smart enough to dynamically figure out old vs new versions on a query-by-query basis. May need an atomic database-wide cutover; running a DB with internally mixed collation versions doesn’t seem like a small challenge. It would require enough disk space for two copies of all indexes, and queries would change which indexes they use in a way that wouldn’t be immediately obvious to users or app dev. Suddenly switching to or from a differently-bloated index could result in confusing and sudden performance changes.

Also there would still need to be a plan to address all the other non-index objects where collation is used, as has been mentioned before.

And given the current architecture, that final “alter database update default collation” command still seems awful risky, bug-prone and difficult to get correct. At least it seems that way to me.

At a minimum, this is a very big project and it seems to me like it may be wise to get more end-to-end fleshing out of the plans before committing incremental pieces in core (which could end up being misguided if the plan doesn’t work as well as assumed). Definitely doesn’t seem to me like anything that will happen in a year or two.

And my opinion is that the problems caused by depending on OS libraries for collation need to be addressed on a shorter timeline than what’s realistic for inventing a new way for a relational database to offer transparent or online upgrades of linguistic collation versions.

Also I still think folks are overcomplicating this by focusing on linguistic collation as the solution. Like 1% of users actually need or care about having the latest technically correct local-language-based sorting, at a database level. MySQL did the right thing here by doing what every other RDBMS did, and just making a simple “good-enough” collation hardcoded in the DB, same across all platforms, that never changes.

The 1% of users who need true linguistic collation can probably deal with the trade-off of dump-and-load upgrades for their ICU indexes and databases for a few more years.

-Jeremy

Sent from my TI-83

#49

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeremy Schneider (#48)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

I’m probably just going to end up rehashing the old threads I haven’t read yet…

One challenge with this approach is you have things like sort-merge joins that require the same collation across multiple objects. So I think you’d need to keep all the old indexes around until you have new indexes available for all objects in a database, and somehow the planner would need to be smart enough to dynamically figure out old vs new versions on a query-by-query basis.

I don't think that it would be fundamentally difficult to have the
planner deal with collations at the level required to avoid incorrect
query plans.

I'm not suggesting that this is an easy project, or that the end
result would be totally free of caveats, such as the issue with merge
joins. I am only suggesting that something like this seems doable.
There aren't that many distinct high level approaches that could
possibly decouple upgrading Postgres/the OS from reindexing. This is
one.

And my opinion is that the problems caused by depending on OS libraries for collation need to be addressed on a shorter timeline than what’s realistic for inventing a new way for a relational database to offer transparent or online upgrades of linguistic collation versions.

But what does that really mean? You can use ICU collations as the
default for the entire cluster now. Where do we still fall short? Do
you mean that there is still a question of actively encouraging using
ICU collations?

I don't understand what you're arguing for. Literally everybody agrees
that the current status quo is not good. That much seems settled to
me.

Also I still think folks are overcomplicating this by focusing on linguistic collation as the solution.

I don't think that's true; I think that everybody understands that
being on the latest linguistic collation is only very rarely a
compelling feature. The whole way that BCP47 tags are so forgiving is
entirely consistent with that view of things.

But what difference does it make? As long as you accept that any
collation *might* need to be updated, or the default ICU version might
change on OS upgrade, then you have to have some strategy for dealing
with the transition. Not being on a very old obsolete version of ICU
will eventually become a "compelling feature" in its own right.

I believe that EDB adopted ICU many years ago, and stuck with one
vendored version for quite a few years. And eventually being on a very
old version of ICU became a real problem.

--
Peter Geoghegan

#50

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeremy Schneider (#48)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

MySQL did the right thing here by doing what every other RDBMS did, and just making a simple “good-enough” collation hardcoded in the DB, same across all platforms, that never changes.

That's not true. Both SQL Server and DB2 have some notion of
collations that are versioned.

Oracle may not, but then Oracle also handles collations by indexing
strxfrm() blobs, with all of the obvious downsides that that entails
(far larger indexes, issues with index-only scans). That seems like an
excellent example of what not to do.

--
Peter Geoghegan

#51

Finnerty, Jim

jfinnert@amazon.com

over 3 years ago

In reply to: Thomas Munro (#33)

Re: Collation version tracking for macOS

Specifying the library name before the language-country code with a new separator (":") as you suggested below has some benefits. Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel, etc.?
For example, an alternate syntax might be:

create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Was the concern that ICU might redefine a new collation property with the same name in a different and incompatible way (we might work with the ICU developers to agree on what it should be), or that a version is just not the same kind of collation property as the other collation properties?

(in the example above, I'm assuming that for provider = icu, we could translate '63' into 'libicui18n.so.63' automatically.)

On 6/8/22, 6:22 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

<snip>
postgres=# create collation icu63."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.63:en-US');
CREATE COLLATION
<snip>

#52

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Finnerty, Jim (#51)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim <jfinnert@amazon.com> wrote:

Specifying the library name before the language-country code with a new separator (":") as you suggested below has some benefits. Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel, etc.?
For example, an alternate syntax might be:

create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Why would a user want to specify an ICU version in DDL? Wouldn't that
break in the event of a dump and reload of the database, for example?
It also strikes me as being inconsistent with the general philosophy
for ICU and the broader BCP45 IETF standard, which is "interpret the
locale string to the best of our ability, never throw an error".

Your proposed syntax already "works" today! You just need to create a
schema called icu63 -- then the command executes successfully (for
certain values of successfully).

I'm not arguing against the need for something like this. I'm just
pointing out that there are good reasons to imagine that it would
largely be an implementation detail, perhaps only used to
unambiguously identify which specific ICU version and locale string
relate to which on-disk relfilenode structure currently.

--
Peter Geoghegan

#53

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Finnerty, Jim (#51)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 9:20 AM Finnerty, Jim <jfinnert@amazon.com> wrote:

Specifying the library name before the language-country code with a new separator (":") as you suggested below has some benefits.

One of the reasons for putting some representation of desired library
into the colliculocale column (rather than, say, adding a new column
pg_collation) is that I think we'd also want to be able to put that
into daticulocale (for the database default collation, when using
ICU). But really I just did that because it was easy... perhaps, both
pg_collation and pg_database could gain a new column, and that would
be a little more pleasing from a schema design point of view (1NF
atomicity, and it's a sort of foreign key, or at least it would be if
there were another catalog to list library versions...)?

Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel, etc.?
For example, an alternate syntax might be:

create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Hmm, I hadn't considered that. (I wouldn't call it "col" version BTW,
it's a library version, and we don't want to overload our terminology
for collation version. We'd still be on the look out for collversion
changes coming from a single library's minor version changing, for
example an apt-get upgrade can replace the .63 files, which on most
systems are symlinks to .63.1, .63.2 etc. ☠️)

Was the concern that ICU might redefine a new collation property with the same name in a different and incompatible way (we might work with the ICU developers to agree on what it should be), or that a version is just not the same kind of collation property as the other collation properties?

Well my first impression is that we don't really own that namespace,
and since we're using this to decide which library to route calls to,
it seems nicer to put it at a "higher level" than those properties.
So I'd prefer something like "63:en-US", or 63 in a new column.

(in the example above, I'm assuming that for provider = icu, we could translate '63' into 'libicui18n.so.63' automatically.)

Yeah. My patch that jams a library name in there was just the fastest
way I could think of to get something off the ground to test whether I
could route calls to different libraries (yes!), though at one moment
I thought it wasn't terrible. But aside from any aesthetic complaints
about that way of doing it, it turns out not to be enough: we need to
dlopen() two different libraries, because we also need some ctype-ish
functions from this guy:

$ nm -D -C /usr/lib/x86_64-linux-gnu/libicuuc.so.63.1 | grep u_strToUpper
00000000000d22c0 T u_strToUpper_63

I guess we probably want to just put "63" somewhere in pg_collation,
as you say. But then, teaching PostgreSQL how to expand that to a
name that is platform/packaging dependent seems bad. The variations
would probably be minor; on a Mac it's .dylib, on AIX it may be .a,
and the .63 convention may not be universal, I dunno, but some systems
might need absolute paths (depending on ld.so.conf etc), but that's
all stuff that I think an administrator should care about, not us.

Perhaps there could be a new catalog table just for that. So far I
have imagined there would still be one special ICU library linked at
build time, which doesn't need to be dlopen'd, and works automatically
without administrators having to declare it. So a system that has one
linked-in library version 67, and then has two extras that have been
added by an administrator running some new DDL commands might have:

Suppose you pg_upgrade to something that is linked against 71.
Perhaps you'd need to tell it how to dlopen 67 before you can open any
collations with that library, but once you've done that your
collation-dependent partition constraints etc should all hold. I
dunno, lots of problems to figure out here, including quite broad ones
about various migration problems. I haven't understood what Peter G
is suggesting about how upgrades might work, so I'll go and try to do
that...

#54

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#53)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Suppose you pg_upgrade to something that is linked against 71.
Perhaps you'd need to tell it how to dlopen 67 before you can open any
collations with that library, but once you've done that your
collation-dependent partition constraints etc should all hold. I
dunno, lots of problems to figure out here, including quite broad ones
about various migration problems. I haven't understood what Peter G
is suggesting about how upgrades might work, so I'll go and try to do
that...

I'm mostly just arguing for the idea that we should treat ICU versions
as essentially interchangeable in terms of their high-level
capabilities around collations and languages/scripts/whatever provided
for by the underlying CLDR version -- tools like pg_dump shouldn't
need to care about ICU versions per se. *ICU itself* should be
versioned, rather than having multiple independent ICU collation
providers. This should work as well as anything like this can ever be
expected to work -- because internationalization is just hard.

These remarks need to be interpreted in the context of how
internationalization is *supposed* to work under standards like BCP47
(again, this is a broad RFC about internationalization, not really an
ICU thing). Natural languages are inherently squishy, messy things.
The "default ICU collations" that initdb puts in pg_collation are not
really special to ICU -- we generate them through a quasi-arbitrary
process that iterates through top-level locales, which results in a
list that is a bit like what you get with libc collations. If you
pg_upgrade, you might have leftover "default ICU collations" that
wouldn't have been the default on a new initdb. It's inherently pretty
chaotic (because humans aren't as predictable as computers), which is
why BCP47 itself is so forgiving -- it literally has to be. Plus there
really isn't much downside to being so lax; as Jeremy pretty much said
already, the important thing is generally to have roughly the right
idea -- which this fuzzy approach mostly manages to do.

Let's not fight that. Let's leave the natural language stuff to the
experts, by versioning a single collation provider (like ICU), and
generalizing the definition of a collation along the same lines --
something that can be implemented using any available version of ICU
(with a preference for the latest on REINDEX, perhaps). It might turn
out that an older version does a slightly better job than a newer
version (regressions cannot be ruled out), but ultimately that's not
our problem. It can't be -- we're not the unicode consortium.

It's theoretically up to the user to make sure they're happy with any
behavioral changes under this scheme, perhaps by testing. They won't
actually test very often, of course, but that shouldn't matter in
practice. This is already what we advise for users that use advanced
tailorings of custom ICU collations, such as a custom collation for
"natural sorting", often used for things like alphanumeric invoice
numbers. That might break if you downgrade ICU version, and maybe even
if you upgrade ICU version.

--
Peter Geoghegan

#55

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#52)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 10:29 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim <jfinnert@amazon.com> wrote:

For example, an alternate syntax might be:

create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63');

Why would a user want to specify an ICU version in DDL? Wouldn't that
break in the event of a dump and reload of the database, for example?
It also strikes me as being inconsistent with the general philosophy
for ICU and the broader BCP45 IETF standard, which is "interpret the
locale string to the best of our ability, never throw an error".

Your proposed syntax already "works" today! You just need to create a
schema called icu63 -- then the command executes successfully (for
certain values of successfully).

Jim was proposing the @colVersion=63 part, but the schema part came
from my example upthread. That was from a real transcript, and I
included that because the way I've been thinking of this so far has
distinct collation OIDs for the "same" collation from different ICU
libraries, and yet I want them to have the same collname. That is, I
don't want (say) "en-US-x-icu63" and "en-US-x-icu71"... I thought it'd
be nice to keep using "en-US-x-icu" as we do today, so if there are
two of them they'd *have* to be in different schemas. That has the
nice property that you can use the search_path to avoid mentioning it.
But I'm not at all wedded to that idea, or any other ideas in this
thread, just trying stuff out...

However, since you mentioned that a simple REINDEX would get you from
one library version to another, I think we're making some completely
different assumptions somewhere along the line, and I don't get your
idea yet. It sounds like you don't want two different collation OIDs
in that case?

The (vastly too) simplistic way I was thinking of it, if you have a
column with an ICU 63 collation, to switch to ICU 67 you first do some
DDL to add ICU 67 to your system and import 67's collations (creating
new collation OIDs), and then eg ALTER TABLE foo ALTER COLUMN bar TYPE
text COLLATE icu67."en-US-x-icu", which will rebuild your indexes.
That's a big job, and doesn't address how you switch the database
default collation. None of that is very satisfying, much more thought
needed, but it falls out of the decision to have distinct
icu63."en-US-x-icu" and icu67."en-US-x-icu". You seem to have some
other idea in mind where the system only knows about one
"en-US-x-icu", but somehow, somewhere else (where?), keeps track of
which indexes were built with ICU 63 and which with ICU 67, which I
don't yet grok. Or did I misunderstand?

#56

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#55)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

However, since you mentioned that a simple REINDEX would get you from
one library version to another, I think we're making some completely
different assumptions somewhere along the line, and I don't get your
idea yet. It sounds like you don't want two different collation OIDs
in that case?

Not completely sure about the REINDEX behavior, but it's at least an
example of the kind of thing that could be enabled. I'm proposing that
pg_collation-wise collations have the most abstract possible
definitions -- "logical collations", which are decoupled from
"physical collations" that actually describe a particular ICU collator
associated with a particular ICU version (all the information that
keeps how the on-disk structure is organized for a given relfilenode
straight). In other words, the definition of a collation is the user's
own definition. To the user, it's pretty close to (maybe even exactly)
a BCP47 string, now and forever.

You can make arguments against the REINDEX behavior. And maybe those
arguments will turn out to be good arguments. Assuming that they are,
then the solution may just be to have a special option that will make
the REINDEX use the most recent library.

The important point is to make the abstraction as high level as
possible from the point of view of users.

You seem to have some
other idea in mind where the system only knows about one
"en-US-x-icu", but somehow, somewhere else (where?), keeps track of
which indexes were built with ICU 63 and which with ICU 67, which I
don't yet grok. Or did I misunderstand?

That's what I meant, yes -- you got it right.

Another way to put it would be to go as far as we can in the direction
of decoupling the concerns that we have as database people from the
concerns of natural language experts. Let's not step on their toes,
and let's avoid having our toes trampled on.

--
Peter Geoghegan

#57

Tobias Bussmann

t.bussmann@gmx.net

over 3 years ago

In reply to: Thomas Munro (#6)

Re: Collation version tracking for macOS

Thanks for picking this up!

How can I see evidence of this? I'm comparing Debian, FreeBSD and
macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
/usr/share/dict/words" I get upper and lower case mixed together on
the other OSes, but on the Mac the upper case comes first, which is my
usual smoke test for "am I looking at binary sort order?"

Perhaps I can shed some light on this matter:

Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are separated (There are exceptions like 'cs_CZ.ISO8859-2'). The latest public sources I can find are in adv_cmds-119 [1]https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef which belongs to OSX 10.5 [2]https://opensource.apple.com/releases/ - these correspond to the ones used in FreeBSD till v10 [3]https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef, whereby the timestamps rather point its origin around FreeBSD 5. Further, there are only very few locales actually present on macOS (36 - none of it supporting Unicode) and these have not changed for a very long time (I verified that from OS X 10.6.8 till macOS 12.4 [4]find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \;, exception is a 'de_DE-A.ISO8859-1' present only in macOS 10.15).

What they do instead is symlinking [5]https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile missing collations to similar ones even across encodings, often resulting in la_LN.US-ASCII ('la_LN' seem to stand for a Latin meta language) being used which is exactly byte order [6]https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src. These symlinks have not changed [7]find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \; from OS X 10.6.8 till macOS 10.15.7. But in macOS 11 many of these symlinks changed their target. So did the popular 'en_US.UTF-8' from 'la_LN.US-ASCII' to 'la_LN.ISO8859-1' or 'de_DE.UTF-8' from 'la_LN.US-ASCII' to 'de_DE.ISO8859-1'. In effect, about half of the UTF-8 collations change from no collation to partial/broken collation support. macOS 12 again shows no changes - tests for macOS 13 are outstanding.

# tl:dr;

With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo '5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort /usr/share/dict/words)"

The upside is that we don't have to cope with the new characters added in every version of Unicode (although I have not examined LC_CTYPE yet).

best regards
Tobias

[1]: https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef
[2]: https://opensource.apple.com/releases/
[3]: https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef
[4]: find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \;
[5]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile
[6]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src
[7]: find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \;

#58

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#56)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:

You seem to have some
other idea in mind where the system only knows about one
"en-US-x-icu", but somehow, somewhere else (where?), keeps track of
which indexes were built with ICU 63 and which with ICU 67, which I
don't yet grok. Or did I misunderstand?

That's what I meant, yes -- you got it right.

OK, I see now.

I think if you design a system to record the library that each index
(and constraint, ...) was built with, it'd surely finish up being at
least conceptually something like the system Julien and I built and
then reverted in ec483147. Except that it'd be a stronger form of
that, because instead of just squawking when the version is not the
latest/current version, it'd keep working but route collations to the
older library for indexes that haven't been rebuilt yet.

That sounds nice, but introduces subtle problems for the planner. For
example, pathkeys that look compatible might not be, when
merge-joining an ICU 63 index scan against an ICU 67 index scan. You
could teach it about that, whereas with my distinct OID concept they
would already be considered non-matching automatically.

#59

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#58)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:

That sounds nice, but introduces subtle problems for the planner. For
example, pathkeys that look compatible might not be, when
merge-joining an ICU 63 index scan against an ICU 67 index scan. You
could teach it about that, whereas with my distinct OID concept they
would already be considered non-matching automatically.

Right -- my proposal is likely to be more difficult to implement.
Seems like it might be worth going to the trouble of teaching the
planner about this difference, though.

That exact issue seems like the true underlying problem to me: we have
two sets of behaviors for a given collation, that are equivalent for
some purposes (the user thinks of them as totally interchangeable),
but not for other purposes (we can't expect old indexes to continue to
work with a new physical collation for their logical collation). So
directly tackling that seems natural to me.

--
Peter Geoghegan

#60

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#59)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 1:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:

That sounds nice, but introduces subtle problems for the planner. For
example, pathkeys that look compatible might not be, when
merge-joining an ICU 63 index scan against an ICU 67 index scan. You
could teach it about that, whereas with my distinct OID concept they
would already be considered non-matching automatically.

Right -- my proposal is likely to be more difficult to implement.
Seems like it might be worth going to the trouble of teaching the
planner about this difference, though.

Well I can report that the system from ec483147 was hellishly
complicated, and not universally loved. Which isn't to say that there
isn't a simple and loveable way to do it, waiting to be discovered,
and I do think we could fix most of the problems with that work. It's
just that I was rather thinking of this new line of attack as being a
way to avoid the complications of identifying dependencies on moving
things through complicated analysis of object graphs and AST, by
instead attaching those slippery external things to the floor with a
nail gun. That is, treating ICU 63 and ICU 67's collations as
completely unrelated. I understand that that's not ideal from an
end-user perspective, but maybe it's more realistically and robustly
and simply implementable. Hmm.

#61

Tobias Bussmann

t.bussmann@gmx.net

over 3 years ago

In reply to: Tom Lane (#36)

Re: Collation version tracking for macOS

Am 08.06.2022 um 16:16 schrieb Tom Lane <tgl@sss.pgh.pa.us>:

The proposed patch would result in a warning about every collation-
sensitive index during every macOS major version upgrade, ie about
once a year for most people.
We need something that has at least *some* connection to actual changes.

In Postgres.app we introduced default collation versioning and warnings about possible mismatches from outside the actual server. When the user runs initdb with the GUI wrapper, the OS version and a checksum of the LC_COLLATE file of the used default collation is stored as meta-data. This allows to display a reindex warning on startup if the hash changes or we hardcode a known incompatible OS change.

Having collversion support on macOS within postgres would leverage the existing infrastructure for version change warnings and enables support for multiple collations. But I agree, we need something more specific than the major OS version here. Lacking any collation version information from the provider, a checksum on the binary LC_COLLATE file is the best I can come up with.

Best regards,
Tobias

#62

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#60)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Well I can report that the system from ec483147 was hellishly
complicated, and not universally loved. Which isn't to say that there
isn't a simple and loveable way to do it, waiting to be discovered,
and I do think we could fix most of the problems with that work.

I admit that I don't have much idea of how difficult it would be to
make it all work. I'm definitely not claiming that it's easy.

I understand that that's not ideal from an
end-user perspective, but maybe it's more realistically and robustly
and simply implementable. Hmm.

That may be a decisive reason to go with your proposal. I really don't know.

--
Peter Geoghegan

#63

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#62)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Well I can report that the system from ec483147 was hellishly
complicated, and not universally loved. Which isn't to say that there
isn't a simple and loveable way to do it, waiting to be discovered,
and I do think we could fix most of the problems with that work.

I admit that I don't have much idea of how difficult it would be to
make it all work. I'm definitely not claiming that it's easy.

Hrrm... perhaps my memory of ec483147 is confusing me. I think I'm
starting to come around to your idea a bit more now. Let me sketch
out some more details here and see where this goes.

I *was* thinking that you'd have to find all references to collations
through static analysis, as we did in that version tracking project.
But perhaps for this you only need to record one ICU library version
for the whole index at build time, without any analysis at all, and it
would be used for any and all ICU collations that are reached while
evaluating anything to do with that index (index navigation, but also
eg WHERE clause for partial index, etc). That would change to the
"current" value when you REINDEX.

Perhaps that could be modeled with a pg_depend row pointing to a
pg_icu_library row, which you'd probably need anyway, to prevent a
registered ICU library that is needed for a live index from being
dropped. (That's assuming that the pg_icu_library catalogue concept
has legs... well if we're going with dlopen(), we'll need *somewhere*
to store the shared object paths. Perhaps it's not a given that we
really want paths in a table... I guess it might prevent certain
cross-OS streaming rep scenarios, but mostly that'd be solvable with
symlinks...)

One problem is that to drop an old pg_icu_library row, you'd have to
go and REINDEX everything, even indexes that don't really use
collations! If you want to prove that an index doesn't use
collations, you're back in ec483147 territory. Perhaps we don't care
about that and we're happy to let useless dependencies on
pg_icu_library rows accumulate, or to require useless work to be able
to drop them.

I'm not sure how we'd know what the "current" library version is. The
highest numbered one currently in that pg_icu_library catalogue I
sketched? So if I do whatever new DDL we invent to tell the system
about a new ICU library, and it's got a higher number than any others,
new indexes start using it but old ones keep using whatever they're
using. Maybe with some way for users to override it, so users who
really want to use an older one when creating a new index can say so.

I suppose it would be the same for constraints. For those,
considering that they need to be rechecked, the only way to change ICU
version would be to drop the constraint and recreate it. Same goes
for range partitioned tables, right? It'd keep using the old ICU
library until you drop the p table and create a new one, at which
point you're using the new current ICU library and it'll recheck all
your partitions against the constraints when you add them. (Those
constraints are much simpler constants, so for those we could prove no
use of ICU without the general ec483147 beast.)

I think these things would have to survive pg_upgrade, but would be
lost on dump/restore.

There's still the pathkey problem to solve, and maybe some more
problems like that hiding somewhere.

I'm not sold on any particular plan, but working through some examples
helped me see your idea better... I may try to code that up in a
minimal way so we can kick the tyres...

#64

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Tobias Bussmann (#57)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 12:48 PM Tobias Bussmann <t.bussmann@gmx.net> wrote:

Perhaps I can shed some light on this matter:

Hi Tobias,

Oh, thanks for your answers. Definitely a few bits of interesting
archeology I was not aware of.

Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are separated (There are exceptions like 'cs_CZ.ISO8859-2').

Wow. I see that I can sort the English dictionary the way most people
expect by pretending it's Czech. What a mess!

With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo '5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort /usr/share/dict/words)"

I see, so it does *something*, just not what anybody wants.

#65

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#63)

1 attachment(s)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 4:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I'm not sold on any particular plan, but working through some examples
helped me see your idea better... I may try to code that up in a
minimal way so we can kick the tyres...

I did a bit of hacking on that idea. The goal was to stamp each index
with an ICU major version (not sure where, not done in the attached),
and if that doesn't match the library we're linked against, we'd try
to dlopen() libraries via symlinks with known name formats under
PGDATA/pg_icu_lib, which an administrator would have to create. That
seemed a bit simpler than dealing with new catalogs for now...

See attached unfinished patch, which implements some of that. It has
a single collation for en-US-x-icu, and routes calls to different
libraries depending on dynamic scope (which in cold hard reality
translates into a nasty global variable "current_icu_library"). The
idea was that it would normally point to the library we're linked
against, but whenever computing anything related to an index stamped
with ICU 63, we'd do pg_icu_activate_major_version(63), and afterwards
undo that. Performance concerns aside, that now seems a bit too ugly
and fragile to me, and I gave up. How could we convince ourselves
that we'd set the active ICU library correctly in all the required
dynamic scopes, but not leaked it into any other scopes? Does that
even make sense? But if not done like that, how else could we do it?

Better ideas/code welcome.

Executive summary of experiments so far: the "distinct collations"
concept is quite simple and robust, but exposes all the versions to
users and probably makes it really hard to upgrade (details not worked
out), while the "time travelling collations" concept is nice for users
but hard to pin down and prove correctness for since it seems to
require dynamic scoping/global state changes affecting code in far
away places.

Attachments:

v2-0001-WIP-allow-multiple-ICU-libraries.patchtext/x-patch; charset=US-ASCII; name=v2-0001-WIP-allow-multiple-ICU-libraries.patchDownload

From e842402fd6eeca413915c3808191d7928dfc0889 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v2] WIP: allow multiple ICU libraries

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  18 +-
 src/backend/utils/adt/pg_locale.c  | 280 +++++++++++++++++++++++++++--
 src/backend/utils/adt/varchar.c    |  16 +-
 src/backend/utils/adt/varlena.c    |  57 +++---
 src/backend/utils/init/miscinit.c  |   9 +
 src/bin/initdb/initdb.c            |   1 +
 src/include/utils/pg_locale.h      |  94 +++++++++-
 7 files changed, 420 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..81571658ed 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,13 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = current_icu_library->getSortKey(pg_icu_collator(current_icu_library,
+																	mylocale),
+													uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			current_icu_library->getSortKey(pg_icu_collator(current_icu_library,
+															mylocale),
+											uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -352,14 +354,14 @@ hashtextextended(PG_FUNCTION_ARGS)
 			UChar	   *uchar = NULL;
 			Size		bsize;
 			uint8_t    *buf;
+			UCollator  *ucol;
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			ucol = pg_icu_collator(current_icu_library, mylocale);
+			bsize = current_icu_library->getSortKey(ucol, uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			current_icu_library->getSortKey(ucol, uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index a0490a7522..d9d17f1b8c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/hsearch.h"
@@ -78,6 +79,10 @@
 #include <shlwapi.h>
 #endif
 
+#ifdef HAVE_DLOPEN
+#include <dlfcn.h>
+#endif
+
 #define		MAX_L10N_DATA		80
 
 
@@ -1435,29 +1440,265 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Table of ICU libraries we have loaded. */
+static pg_icu_library *icu_libraries[PG_NUM_ICU_MAJOR_VERSIONS];
+
+/*
+ * Currently active ICU library.  Normally this corresponds to the ICU library
+ * that we were compiled and linked against, but while interacting with
+ * indexes built with an old ICU library, it'll be changed.
+ */
+pg_icu_library *current_icu_library;
+
+/*
+ * Free an ICU library.  pg_icu_library objects that are successfully
+ * constructed stick around for the lifetime of the backend, but this is used
+ * to clean up if initialization fails.
+ */
+static void
+free_icu_library(pg_icu_library *l)
+{
+	if (l->handle)
+		dlclose(l->handle);
+	pfree(l);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char name[80];
+
+	snprintf(name, sizeof(name), "%s_%d", function, version);
+
+	return dlsym(handle, name);
+}
+
+/*
+ * Probe a dynamically loaded library to see which major version of ICU it
+ * contains.
+ */
+static int
+get_icu_library_major_version(void *handle)
+{
+	for (int i = PG_MIN_ICU_MAJOR_VERSION; i <= PG_MAX_ICU_MAJOR_VERSION; ++i)
+		if (get_icu_function(handle, "ucol_open", i))
+			return i;
+
+	/*
+	 * It's a later version we don't dare use, an old version we don't
+	 * support, an ICU build with symbol suffixes disabled, or not ICU.
+	 */
+	return -1;
+}
+
+/*
+ * Given an ICU major version number, return the object we need to access it,
+ * or fail while trying to load it.
+ */
+static pg_icu_library *
+get_icu_library(int major_version)
+{
+	pg_icu_library *l;
+
+	Assert(major_version >= PG_MIN_ICU_MAJOR_VERSION &&
+		   major_version <= PG_MAX_ICU_MAJOR_VERSION);
+
+	/* See if it's already loaded. */
+	l = icu_libraries[PG_ICU_SLOT(major_version)];
+	if (l)
+		return l;
+
+	/* Make a new entry. */
+	l = MemoryContextAllocZero(TopMemoryContext, sizeof(*l));
+	if (major_version == U_ICU_VERSION_MAJOR_NUM)
+	{
+		/*
+		 * This is the version we were compiled and linked against.  Simply
+		 * assign the function pointers.
+		 *
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		l->major_version = major_version;
+		l->open = ucol_open;
+		l->close = ucol_close;
+		l->getVersion = ucol_getVersion;
+		l->versionToString = u_versionToString;
+		l->strcoll = ucol_strcoll;
+		l->strcollUTF8 = ucol_strcollUTF8;
+		l->getSortKey = ucol_getSortKey;
+		l->nextSortKeyPart = ucol_nextSortKeyPart;
+		l->errorName = u_errorName;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+#ifdef HAVE_DLOPEN
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+		int			found_major_version;
+
+		/*
+		 * We don't like to open versions newer than what we're linked
+		 * against, to reduce the risk of an API change biting us.
+		 */
+		if (major_version > U_ICU_VERSION_MAJOR_NUM)
+			elog(ERROR, "ICU major version %d higher than linked version %d, refusing to open",
+				 major_version, U_ICU_VERSION_MAJOR_NUM);
+
+		/*
+		 * On many distributions, multiple ICU libraries can be installed
+		 * concurrently, but we don't want to guess how to find them.  The
+		 * administrator will need to put libraries or symlinks under
+		 * pg_icu_lib.
+		 */
+		snprintf(libicui18n_name,
+				 sizeof(libicui18n_name),
+				 "%s/pg_icu_lib/libicui18n.so.%d",
+				 DataDir,
+				 major_version);
+		snprintf(libicuuc_name,
+				 sizeof(libicuuc_name),
+				 "%s/pg_icu_lib/libicuuc.so.%d",
+				 DataDir,
+				 major_version);
+
+		l->handle = dlopen(libicui18n_name, RTLD_NOW | RTLD_GLOBAL);
+		if (l->handle == NULL)
+		{
+			int errno_save = errno;
+			free_icu_library(l);
+			errno = errno_save;
+
+			ereport(ERROR,
+					(errmsg("could not load library \"%s\": %m", libicui18n_name)));
+		}
+
+		/* Sanity check the version. */
+		found_major_version = get_icu_library_major_version(l->handle);
+		if (found_major_version < 0)
+		{
+			free_icu_library(l);
+			ereport(ERROR,
+					(errmsg("could not find compatible ICU major version in library \"%s\"",
+							libicui18n_name)));
+		}
+		if (found_major_version != major_version)
+		{
+			free_icu_library(l);
+			ereport(ERROR,
+					(errmsg("expected to find ICU major version %d in library \"%s\", but found %d",
+							major_version, libicui18n_name, major_version)));
+		}
+		l->major_version = major_version;
+
+		/* Look up all the functions we need. */
+		l->open = get_icu_function(l->handle, "ucol_open", major_version);
+		l->close = get_icu_function(l->handle, "ucol_close", major_version);
+		l->getVersion = get_icu_function(l->handle, "ucol_getVersion",
+										 major_version);
+		l->versionToString = get_icu_function(l->handle, "u_versionToString",
+											  major_version);
+		l->strcoll = get_icu_function(l->handle, "ucol_strcoll",
+									  major_version);
+		l->strcollUTF8 = get_icu_function(l->handle, "ucol_strcollUTF8",
+										  major_version);
+		l->getSortKey = get_icu_function(l->handle, "ucol_getSortKey",
+										 major_version);
+		l->nextSortKeyPart = get_icu_function(l->handle, "ucol_nextSortKeyPart",
+											  major_version);
+		l->errorName = get_icu_function(l->handle, "u_errorName",
+										major_version);
+		if (!l->open ||
+			!l->close ||
+			!l->getVersion ||
+			!l->versionToString ||
+			!l->strcoll ||
+			!l->strcollUTF8 ||
+			!l->getSortKey ||
+			!l->nextSortKeyPart ||
+			!l->errorName)
+		{
+			free_icu_library(l);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in library \"%s\"",
+							libicui18n_name)));
+		}
+#else
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("dynamically loaded ICU libraries are not supported in this build")));
+#endif
+	}
+
+	icu_libraries[major_version - PG_MIN_ICU_MAJOR_VERSION] = l;
+
+	return l;
+}
+
+/*
+ * Set the current active ICU major version.  Use -1 for the ICU library we're
+ * linked against.  (That avoids callers needing to include ICU headers to
+ * find that).
+ *
+ * This is set to the ICU library we linked against at transaction start, but
+ * might temporarily be changed to an older dlopen'd one while interacting
+ * with objects created by a PostgreSQL binary linked against an older ICU
+ * library.
+ */
+void pg_icu_activate_major_version(int major_version)
+{
+	current_icu_library =
+		get_icu_library(major_version == -1 ?
+						U_ICU_VERSION_MAJOR_NUM : major_version);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *l;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	/*
+	 * Initially we will open the collator with the linked ICU library only.
+	 * Collators for any other versions we need later will be opened on demand
+	 * with the appropriate library.
+	 */
+	l = get_icu_library(U_ICU_VERSION_MAJOR_NUM);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = l->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, l->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	if (l->major_version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
+	resultp->info.icu.ucol[PG_ICU_SLOT(U_ICU_VERSION_MAJOR_NUM)] = collator;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1688,21 +1929,29 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *l;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		/*
+		 * XXX Here we're only reporting the version from the linked ICU
+		 * library!  The catalog arrangement for collversion doesn't make any
+		 * sense, in a world with multiple ICU libraries accessible through
+		 * one collation OID.
+		 */
+		l = get_icu_library(U_ICU_VERSION_MAJOR_NUM);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = l->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, l->errorName(status))));
+		l->getVersion(collator, versioninfo);
+		l->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		l->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1770,6 +2019,8 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1991,19 +2242,22 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *l;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	/* We'll use the linked ICU library to check for validity. */
+	l = get_icu_library(U_ICU_VERSION_MAJOR_NUM);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = l->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, l->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	if (l->major_version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	l->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index bbeb0a2653..dfd2b1c575 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1022,14 +1022,14 @@ hashbpchar(PG_FUNCTION_ARGS)
 			UChar	   *uchar = NULL;
 			Size		bsize;
 			uint8_t    *buf;
+			UCollator  *ucol;
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			ucol = pg_icu_collator(current_icu_library, mylocale);
+			bsize = current_icu_library->getSortKey(ucol, uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			current_icu_library->getSortKey(ucol, uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1083,14 +1083,14 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 			UChar	   *uchar = NULL;
 			Size		bsize;
 			uint8_t    *buf;
+			UCollator  *ucol;
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			ucol = pg_icu_collator(current_icu_library, mylocale);
+			bsize = current_icu_library->getSortKey(ucol, uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			current_icu_library->getSortKey(ucol, uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 919138eaf3..abcc623f5a 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1666,10 +1666,11 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = current_icu_library->strcollUTF8(pg_icu_collator(current_icu_library,
+																			  mylocale),
+															  arg1, len1,
+															  arg2, len2,
+															  &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("collation failed: %s", u_errorName(status))));
@@ -1685,9 +1686,10 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = current_icu_library->strcoll(pg_icu_collator(current_icu_library,
+																		  mylocale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2389,13 +2391,15 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = current_icu_library->strcollUTF8(pg_icu_collator(current_icu_library,
+																		  sss->locale),
+														  a1p, len1,
+														  a2p, len2,
+														  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									current_icu_library->errorName(status))));
 			}
 			else
 #endif
@@ -2408,9 +2412,10 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = current_icu_library->strcoll(pg_icu_collator(current_icu_library,
+																	  sss->locale),
+													  uchar1, ulen1,
+													  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2574,21 +2579,23 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uiter_setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = current_icu_library->nextSortKeyPart(pg_icu_collator(current_icu_library,
+																				 sss->locale),
+																 &iter,
+																 state,
+																 (uint8_t *) sss->buf2,
+																 Min(sizeof(Datum), sss->buflen2),
+																 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										current_icu_library->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = current_icu_library->getSortKey(pg_icu_collator(current_icu_library,
+																			sss->locale),
+															uchar, ulen,
+															(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index b25bd0e583..bd257e9b4b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -51,6 +51,7 @@
 #include "utils/guc.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
+#include "utils/pg_locale.h"
 #include "utils/pidfile.h"
 #include "utils/syscache.h"
 #include "utils/varlena.h"
@@ -164,6 +165,10 @@ InitPostmasterChild(void)
 
 	/* Request a signal if the postmaster dies, if possible. */
 	PostmasterDeathSignalInit();
+
+#ifdef USE_ICU
+	pg_icu_activate_major_version(-1);
+#endif
 }
 
 /*
@@ -208,6 +213,10 @@ InitStandaloneProcess(const char *argv0)
 
 	if (pkglib_path[0] == '\0')
 		get_pkglib_path(my_exec_path, pkglib_path);
+
+#ifdef USE_ICU
+	pg_icu_activate_major_version(-1);
+#endif
 }
 
 void
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ed6de7ca94..68bd181a02 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -214,6 +214,7 @@ static const char *const subdirs[] = {
 	"pg_wal/archive_status",
 	"pg_commit_ts",
 	"pg_dynshmem",
+	"pg_icu_lib",
 	"pg_notify",
 	"pg_serial",
 	"pg_snapshots",
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e7385faef8..67a3dab0b1 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -74,6 +74,68 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/* An old ICU release that we know has the right API. */
+#define PG_MIN_ICU_MAJOR_VERSION 54
+
+/*
+ * In a couple of places we use an array of possible versions as a fast
+ * associative table, which isn't too big for now.
+ */
+#define PG_NUM_ICU_MAJOR_VERSIONS								\
+	(PG_MAX_ICU_MAJOR_VERSION - PG_MIN_ICU_MAJOR_VERSION + 1)
+#define PG_ICU_SLOT(major_version)					\
+	((major_version) - PG_MIN_ICU_MAJOR_VERSION)
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	void	   *handle;			/* if loaded with dlopen() */
+	int			major_version;	/* major version of ICU */
+	UCollator *(*open)(const char *loc, UErrorCode *status);
+	void (*close)(UCollator *coll);
+	void (*getVersion)(const UCollator *coll, UVersionInfo info);
+	void (*versionToString)(const UVersionInfo versionArray,
+							char *versionString);
+	UCollationResult (*strcoll)(const UCollator *coll,
+								const UChar *source,
+								int32_t sourceLength,
+								const UChar *target,
+								int32_t targetLength);
+	UCollationResult (*strcollUTF8)(const UCollator *coll,
+									const char *source,
+									int32_t sourceLength,
+									const char *target,
+									int32_t targetLength,
+									UErrorCode *status);
+	int32_t (*getSortKey)(const UCollator *coll,
+						  const UChar *source,
+						  int32_t sourceLength,
+						  uint8_t *result,
+						  int32_t resultLength);
+	int32_t (*nextSortKeyPart)(const UCollator *coll,
+							   UCharIterator *iter,
+							   uint32_t state[2],
+							   uint8_t *dest,
+							   int32_t count,
+							   UErrorCode *status);
+	const char *(*errorName)(UErrorCode code);
+} pg_icu_library;
+
+extern pg_icu_library *current_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -94,7 +156,7 @@ struct pg_locale_struct
 		struct
 		{
 			const char *locale;
-			UCollator  *ucol;
+			UCollator  *ucol[PG_NUM_ICU_MAJOR_VERSIONS];
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -103,6 +165,36 @@ struct pg_locale_struct
 
 typedef struct pg_locale_struct *pg_locale_t;
 
+#ifdef USE_ICU
+/*
+ * Get a collator for 'loc' suitable for use with ICU library 'lib'.
+ */
+static inline UCollator *
+pg_icu_collator(pg_icu_library *lib, pg_locale_t loc)
+{
+	int major_version = lib->major_version;
+	UCollator *collator = loc->info.icu.ucol[PG_ICU_SLOT(major_version)];
+
+	if (unlikely(!collator))
+	{
+		UErrorCode status;
+
+		collator =lib->open(loc->info.icu.locale, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\", ICU major version %d: %s",
+							loc->info.icu.locale,
+							major_version,
+							lib->errorName(status))));
+		loc->info.icu.ucol[PG_ICU_SLOT(major_version)] = collator;
+	}
+
+	return collator;
+}
+
+extern void pg_icu_activate_major_version(int major_version);
+#endif
+
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
 
 extern void make_icu_collator(const char *iculocstr,
-- 
2.36.0

#66

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#65)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Executive summary of experiments so far: the "distinct collations"
concept is quite simple and robust, but exposes all the versions to
users and probably makes it really hard to upgrade (details not worked
out), while the "time travelling collations" concept is nice for users
but hard to pin down and prove correctness for since it seems to
require dynamic scoping/global state changes affecting code in far
away places.

It didn't really occur to me until now that the pg_dump problems that
come with the approach you outlined ("distinct collations") are likely
to be total blockers, and not just disadvantages. It's not just ICU
that prefers approximately correct behavior over throwing an "unknown
collation" error -- the same could be said for pg_dump itself. After
all, pg_dump doesn't care about collation versions -- except when run
in binary mode, for pg_upgrade, where it must satisfy the requirements
of pg_upgrade.

Even today we could be restoring to a server with an older ICU
version, where in general we might not get exactly the behavior the
user expects (though usually only when they've decided to use advanced
features like custom tailorings). So pg_dump already deliberately
disregards the ICU version, for essentially the same reasons that I
argued were good reasons upthread (when talking about a
multi-ICU-version Postgres via "time travelling collations").

Some more thoughts on "time travelling collations":

Doing a version switch in one atomic operation (like a special
REINDEX) isn't going to be practical. We need to be prepared for cases
where a database has a mix of indexes with old and new physical
collations. We certainly cannot allow queries to give wrong
answers...but I tend to doubt that (say) making merge joins work with
two indexes with different physical collations (though matching
logical collations) actually makes much sense. Maybe we can cut scope
in a pragmatic way instead.

The special REINDEX (or whatever) won't work as an atomic
operation...but that doesn't mean that the system as a whole will have
a mix of old and new physical collations forever, or even for very
long. So while everything still has to work correctly, mediocre
performance with certain kinds of plan shapes might be okay.

As you kind of said yourself yesterday, "time travelling collations"
could naturally have an invariant that worked at the index/constraint
level (not the collation level): any given index needs to use only one
underlying ICU version at the same time, no matter what. The invariant
could perhaps be leveraged in the planner -- perhaps we start out with
a working assumption that *every* index is on the newer ICU version
(when at least one index is on the lastest and "now current" version),
and work backwards by excluding indexes that still have old physical
collations. Defining the problem as a problem with old
indexes/constraints only seems like it might make things a lot easier.

--
Peter Geoghegan

#67

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#63)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 9:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Perhaps that could be modeled with a pg_depend row pointing to a
pg_icu_library row, which you'd probably need anyway, to prevent a
registered ICU library that is needed for a live index from being
dropped. (That's assuming that the pg_icu_library catalogue concept
has legs... well if we're going with dlopen(), we'll need *somewhere*
to store the shared object paths. Perhaps it's not a given that we
really want paths in a table... I guess it might prevent certain
cross-OS streaming rep scenarios, but mostly that'd be solvable with
symlinks...)

Do we even need to store a version for indexes most of the time if
we're versioning ICU itself, as part of the "time travelling
collations" design? For that matter, do we even need to version
collations directly anymore?

I'm pretty sure that the value of pg_collation.collversion is always
the same in practice, or has a lot of redundancy. Because mostly it's
just an ICU version. This is what I see on my system, at least:

pg@regression:5432 [53302]=# select count(*), collversion from
pg_collation where collprovider = 'icu' group by 2;
count │ collversion
───────┼─────────────
329 │ 153.112.41
471 │ 153.112
(2 rows)

(Not sure why there are two different distinct collversion values
offhand, but generally looks like collversion isn't terribly
meaningful at the level of individual pg_collation entries.)

If indexes and constraints with old physical collations are defined as
being the exception to the general rule (the rule meaning "every index
uses the current ICU version for the database as a whole"), and if
those indexes/constraints are enumerated and stored (in a new system
catalog) when a switchover of the database's ICU version is first
initialized, then there might not be any meaningful dependency to
speak of. Not for indexes, at least.

The *database as a whole* is dependent on the current version of ICU
-- it's not any one index. Very occasionally the database will also be
dependent on a single older ICU version that we're still transitioning
away from. There is a "switch-a-roo" going on, but not really at the
level of indexes -- it's a very specialized thing, that works at the
level of the whole database, and involves exactly 2 ICU versions. You
should probably be able to back out of it once it begins, but mostly
it's an inflexible process that just does what we need it to do.

Does something like that seem sensible to you?

--
Peter Geoghegan

#68

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#66)

Re: Collation version tracking for macOS

On Sat, Jun 11, 2022 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote:

The special REINDEX (or whatever) won't work as an atomic
operation...but that doesn't mean that the system as a whole will have
a mix of old and new physical collations forever, or even for very
long. So while everything still has to work correctly, mediocre
performance with certain kinds of plan shapes might be okay.

Yeah. And if you don't like the negative effects of a system in
transition, you could also create new otherwise identical indexes, and
then only drop the old ones once that's done, and add enough brains to
keep everything working nicely on the old indexes until enough of the
new indexes exist. Or something. I was thinking that could also be
true for the "distinct collations" concept, only with different
problems to solve...

As you kind of said yourself yesterday, "time travelling collations"
could naturally have an invariant that worked at the index/constraint
level (not the collation level): any given index needs to use only one
underlying ICU version at the same time, no matter what. The invariant
could perhaps be leveraged in the planner -- perhaps we start out with
a working assumption that *every* index is on the newer ICU version
(when at least one index is on the lastest and "now current" version),
and work backwards by excluding indexes that still have old physical
collations. Defining the problem as a problem with old
indexes/constraints only seems like it might make things a lot easier.

Yes, that (posited) invariant was an enabling realisation for the
(unfinished, but IMHO useful to grok) v2 patch. The disabling
realisation that stopped me from finishing it was that I doubt my
ability to find all the right places to wrap with
"pg_icu_activate_major_version(X)" and
"pg_icu_activate_major_version(-1)", and thus the whole approach. Do
you know where to put all the ICU version switching regions, and how
to verify that they cover exactly all the right code, but don't leak
into any of the wrong code, and do you know where to store/retrieve X?
It feels... flimsy to me, but how else could you make collations
behave differently when being called on behalf of some index rather
than some other thing, given no other context? Explicitly passing a
pg_icu_library all over the place also sounds non-fun.

I'm also suspicious that there are more subtle hazards like pathkeys
lurking in the shadows. We go to great effort to recognise matching
and non-matching collations by OID alone, which is why my first
attempt was "distinct [OIDs]", so that'd keep working.

I wondered if DB2's support looked more like "time travel" or
"distinct". Based only on a quick glance at their manual[1]https://www.ibm.com/docs/en/db2/11.1?topic=support-locale-names-sql-xquery, it looks
a bit like they have "I don't care" collations which are subject to
weirdness on upgrade when they change underneath your feet, and then
"distinct" collations which have an explicit prefix to pin down the
version (indirectly via CLDR version) and route to a specific library
(N copies of ICU that ship with it), with a note recommending the
latter for indexes and constraints. So I'd guess you'll stay on the
old versions forever until you explicitly migrate data to a new
collation.

[1]: https://www.ibm.com/docs/en/db2/11.1?topic=support-locale-names-sql-xquery

#69

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#67)

Re: Collation version tracking for macOS

On Sat, Jun 11, 2022 at 3:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

Do we even need to store a version for indexes most of the time if
we're versioning ICU itself, as part of the "time travelling
collations" design? For that matter, do we even need to version
collations directly anymore?

They're still useful for non-ICU collations (for example FreeBSD and
Windows can tell you about version changes based on open standards),
and they're *maybe* still useful for ICU, considering that there
are minor version upgrades, though I hope that would never actually
detect a change if we built a multi-version system like what we are
discussing here. Certainly they don't make sense in the current
catalog layout with TT collations, though, there's only one attribute
to cover N libraries (though the reverted version tracking thing would
handle it just fine, because that moved it into a per-index location).

I mention minor upgrade as a topic to poke at because the popular
Linux distros only allow major ICU versions to be installed
concurrently, but minor versions are also released from time to time
and replace the libraries (well, the .68 library is a symlink to
.68.1, and then changes to .68.2, following typical conventions, but
the packages don't let you have .68.1 and .68.2 at the same time). To
pick a random example, ICU upgraded 68.1 -> 68.2 at one point, which a
bit of googling tells me included CLDR 38 -> CLDR 38.1. It looks like
they tweaked a few super minor things. Could such a change affect the
values that ucol_getVersion() reports? This came up in the last round
of this stuff with Doole[1]/messages/by-id/CADE5jYJTnYaTNXMFKOK-0p44+Dm5LMcRcJ5kVi1MVHomb2QTkQ@mail.gmail.com, but we didn't dig further and I still
don't know what to think about it.

[1]: /messages/by-id/CADE5jYJTnYaTNXMFKOK-0p44+Dm5LMcRcJ5kVi1MVHomb2QTkQ@mail.gmail.com

#70

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#68)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 8:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I'm also suspicious that there are more subtle hazards like pathkeys
lurking in the shadows. We go to great effort to recognise matching
and non-matching collations by OID alone, which is why my first
attempt was "distinct [OIDs]", so that'd keep working.

It's definitely possible that we won't be able to find a workable
solution that deals with "time travel collations" sensibly from the
planner's perspective. It's certainly not a neat adjunct to what we
have. I think that it *might* be possible to find a way to make it
work that is suboptimal, but works. Without being overly clever.

The DB2 docs say "use the CLDR version prefix to avoid unexpected
changes in behavior when upgrading to future releases of the Db2
database". But if you don't do that, and get a change in behavior,
then surely any related indexes must have been rebuilt too. The
interesting part may be what that upgrade looks like in detail.

--
Peter Geoghegan

#71

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#69)

Re: Collation version tracking for macOS

On Fri, Jun 10, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:

They're still useful for non-ICU collations (for example FreeBSD and
Windows can tell you about version changes based on open standards),
and they're *maybe* still useful for ICU, considering that there
are minor version upgrades, though I hope that would never actually
detect a change if we built a multi-version system like what we are
discussing here.

Right. I was mostly just asking this as a rhetorical question.

What about "time travel collations", but without the time travel part?
That is, what about supporting multiple ICU versions per cluster, but
not per database? So you could upgrade the OS and Postgres, using
standard packages that typically just use the latest ICU version --
typically, but not always. If you happen to have been on an older
version of ICU on upgrade, then that version of ICU will still work at
the level of a whole database -- your database. Maybe you can create
new databases with old and new ICU versions if you want to.

That obviously runs into the problem of needing to eventually do a
dump and reload -- but I suppose that "eventually" could be a very
long time. At least the OS package doesn't declare one version of ICU
the blessed version, now and forever, effectively vendoring ICU in a
backdoor fashion. At least old databases have significant runway,
while at the same time new databases that want to use the same
standard Postgres package aren't forced to use the same ancient ICU
version.

--
Peter Geoghegan

#72

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#71)

1 attachment(s)

Re: Collation version tracking for macOS

On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

What about "time travel collations", but without the time travel part?
That is, what about supporting multiple ICU versions per cluster, but
not per database? So you could upgrade the OS and Postgres, using
standard packages that typically just use the latest ICU version --
typically, but not always. If you happen to have been on an older
version of ICU on upgrade, then that version of ICU will still work at
the level of a whole database -- your database. Maybe you can create
new databases with old and new ICU versions if you want to.

That obviously runs into the problem of needing to eventually do a
dump and reload -- but I suppose that "eventually" could be a very
long time. At least the OS package doesn't declare one version of ICU
the blessed version, now and forever, effectively vendoring ICU in a
backdoor fashion. At least old databases have significant runway,
while at the same time new databases that want to use the same
standard Postgres package aren't forced to use the same ancient ICU
version.

Hmm. I think that's effectively what you'd get using my "distinct
collation" patch (v1, or this much better v3, attached), if you put
version prefixes in colliculocale, and updated them in the template
database after an OS upgrade to affect new databases. I realise you
probably mean something a little more automatic...

I think "pinned forever" ICU versions would be useful, because I think
there are very few expert users who want fine generalised control over
version changes, but almost all other users don't care at all about
any of this stuff -- as long as their indexes keep indexing and their
constraints keep constraining. So I think you can make a lot of
people happy by ignoring the complexities of upgrades and providing a
way to nail the version down for the lifetime of the database. Also,
it's not *impossible* to move to a later ICU, it's just a bit tricky;
the key point is that it's under your control if you want to do that,
independently of an OS upgrade, as you said.

Based on my reading of that DB2 manual page, I reckon my v3 "distinct
collation" patch is about as good as what they have. If you don't
choose to use prefixes then later OS upgrades (ie upgrades that change
the version of ICU that PostgreSQL is linked against) might corrupt
your indexes and constraints -- I think that's what they're saying --
though at least we'll try to warn about that with our weak warning
system. If you do choose to use prefixes you'll be stuck on that ICU
version forever, even across updates that cause PostgreSQL to be
linked to future releases of ICU, unless you're prepared to do a whole
bunch of careful upgrading work (perhaps with some future tooling to
help with that).

Attached is a much more fleshed out version of the "distinct
collation" patch (to be clear: v3 descends from v1, while v2 was at
attempt at the timelord approach). Main changes:

1. I now also route strTo{Upper,Lower,Title} to the right version of
ICU. That requires dlopen'ing a second library.

2. You create distinct collations with optional ICU major version
prefixes, along the lines of what Jim was suggesting. For example
(and I'm not actually proposing -x-icu67 suffixes, just avoiding a
collision in my example):

postgres=# create collation "en-x-icu67" (provider = icu , locale = '67:en');
CREATE COLLATION
postgres=# create collation "en-x-icu63" (provider = icu , locale = '63:en');
CREATE COLLATION

3. For versions other than the one we are linked against, it tries to
open libraries with typical names inside $libdir. An administrator
could drop symlinks in there like so:

$ ln -s /usr/lib/x86_64-linux-gnu/libicui18n.so.63 ~/install/lib/postgresql/
$ ln -s /usr/lib/x86_64-linux-gnu/libicuuc.so.63 ~/install/lib/postgresql/

What I like about this $libdir scheme is that I imagine that we could
ask our friends in the packaging teams to create packages for that.
Users would then think of them in much the same way as extensions.
You'd just type:

$ sudo apt-get install postgresql-16-icu71

Happy to keep trying to figure out the competing and rather more
ambitious TT version too (that I sketched some bits of in v2), but I'm
a whole lot fuzzier on how that can work and kinda stuck on the
problems I raised.

Attachments:

v3-0001-WIP-multi-ICU-distinct-collations.patchtext/x-patch; charset=US-ASCII; name=v3-0001-WIP-multi-ICU-distinct-collations.patchDownload

From bf748af1b2576c232bdec633fb8466092fa7ce92 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v3] WIP: multi ICU, distinct collations

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  16 +-
 src/backend/utils/adt/formatting.c |  49 +++-
 src/backend/utils/adt/pg_locale.c  | 355 ++++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c    |  16 +-
 src/backend/utils/adt/varlena.c    |  54 ++---
 src/include/utils/pg_locale.h      |  69 ++++++
 src/tools/pgindent/typedefs.list   |   3 +
 7 files changed, 498 insertions(+), 64 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index e909c1a200..02c8cc5b16 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1602,6 +1602,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1630,14 +1635,35 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->info.icu.locale, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->info.icu.locale, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s", u_errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1705,7 +1731,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToLower,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1827,7 +1854,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToUpper,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1950,8 +1978,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
+			len_conv = icu_convert_case_bi(PG_ICU_LIB(mylocale)->strToTitle,
+										   mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index a0490a7522..d808fbb205 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/hsearch.h"
@@ -68,6 +69,7 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -78,8 +80,27 @@
 #include <shlwapi.h>
 #endif
 
+#ifdef HAVE_DLOPEN
+#include <dlfcn.h>
+#endif
+
 #define		MAX_L10N_DATA		80
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/* An old ICU release that we know has the right API. */
+/* XXX check if this is actually true, and if we could go back further */
+#define PG_MIN_ICU_MAJOR_VERSION 54
+
+#endif
+
 
 /* GUC settings */
 char	   *locale_messages;
@@ -1435,29 +1456,332 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Linked list of ICU libraries we have loaded. */
+static pg_icu_library *icu_library_list = NULL;
+
+/*
+ * Free an ICU library.  pg_icu_library objects that are successfully
+ * constructed stick around for the lifetime of the backend, but this is used
+ * to clean up if initialization fails.
+ */
+static void
+free_icu_library(pg_icu_library *lib)
+{
+	if (lib->libicui18n_handle)
+		dlclose(lib->libicui18n_handle);
+	if (lib->libicuuc_handle)
+		dlclose(lib->libicuuc_handle);
+	pfree(lib);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char		name[80];
+
+	snprintf(name, sizeof(name), "%s_%d", function, version);
+
+	return dlsym(handle, name);
+}
+
+/*
+ * Probe a dynamically loaded library to see which major version of ICU it
+ * contains.
+ */
+static int
+get_icu_library_major_version(void *handle)
+{
+	for (int i = PG_MIN_ICU_MAJOR_VERSION; i <= PG_MAX_ICU_MAJOR_VERSION; ++i)
+		if (get_icu_function(handle, "ucol_open", i) ||
+			get_icu_function(handle, "u_strToUpper", i))
+			return i;
+
+	/*
+	 * It's a later version we don't dare use, an old version we don't
+	 * support, an ICU build with symbol suffixes disabled, or not ICU.
+	 */
+	return -1;
+}
+
+/*
+ * We have to load a couple of different libraries, so we'll reuse the code to
+ * do that.
+ */
+static void *
+load_icu_library(pg_icu_library *lib, const char *name)
+{
+	void	   *handle;
+	int			found_major_version;
+
+	handle = dlopen(name, RTLD_NOW | RTLD_GLOBAL);
+	if (handle == NULL)
+	{
+		int			errno_save = errno;
+
+		free_icu_library(lib);
+		errno = errno_save;
+
+		ereport(ERROR,
+				(errmsg("could not load library \"%s\": %m", name)));
+	}
+
+	found_major_version = get_icu_library_major_version(handle);
+	if (found_major_version < 0)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("could not find compatible ICU major version in library \"%s\"",
+						name)));
+	}
+
+	if (found_major_version != lib->major_version)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("expected to find ICU major version %d in library \"%s\", but found %d",
+						lib->major_version, name, found_major_version)));
+	}
+
+	return handle;
+}
+
+/*
+ * Given an ICU major version number, return the object we need to access it,
+ * or fail while trying to load it.
+ */
+static pg_icu_library *
+get_icu_library(int major_version)
+{
+	pg_icu_library *lib;
+
+	Assert(major_version >= PG_MIN_ICU_MAJOR_VERSION &&
+		   major_version <= PG_MAX_ICU_MAJOR_VERSION);
+
+	/* Try to find it in our list of existing libraries. */
+	for (lib = icu_library_list; lib; lib = lib->next)
+		if (lib->major_version == major_version)
+			return lib;
+
+	/* Make a new entry. */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(*lib));
+	if (major_version == U_ICU_VERSION_MAJOR_NUM)
+	{
+		/*
+		 * This is the version we were compiled and linked against.  Simply
+		 * assign the function pointers.
+		 *
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		lib->major_version = major_version;
+		lib->open = ucol_open;
+		lib->close = ucol_close;
+		lib->getVersion = ucol_getVersion;
+		lib->versionToString = u_versionToString;
+		lib->strcoll = ucol_strcoll;
+		lib->strcollUTF8 = ucol_strcollUTF8;
+		lib->getSortKey = ucol_getSortKey;
+		lib->nextSortKeyPart = ucol_nextSortKeyPart;
+		lib->errorName = u_errorName;
+		lib->strToUpper = u_strToUpper;
+		lib->strToLower = u_strToLower;
+		lib->strToTitle = u_strToTitle;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+#ifdef HAVE_DLOPEN
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+
+		/*
+		 * We don't like to open versions newer than what we're linked
+		 * against, to reduce the risk of an API change biting us.
+		 */
+		if (major_version > U_ICU_VERSION_MAJOR_NUM)
+			elog(ERROR, "ICU major version %d higher than linked version %d, refusing to open",
+				 major_version, U_ICU_VERSION_MAJOR_NUM);
+
+		lib->major_version = major_version;
+
+		/*
+		 * XXX using pkglib_path to look for libraries.  The idea is that an
+		 * administrator would drop symlinks in there.
+		 *
+		 * Perhaps should do something like dfmgr.c's dynamic_library_path
+		 * search to find it?
+		 */
+
+		/* Load the collation library. */
+		snprintf(libicui18n_name,
+				 sizeof(libicui18n_name),
+				 "%s/libicui18n" DLSUFFIX ".%d",
+				 pkglib_path,
+				 major_version);
+		lib->libicui18n_handle = load_icu_library(lib, libicui18n_name);
+
+		/* Load the ctype library. */
+		snprintf(libicuuc_name,
+				 sizeof(libicuuc_name),
+				 "%s/libicuuc" DLSUFFIX ".%d",
+				 pkglib_path,
+				 major_version);
+		lib->libicuuc_handle = load_icu_library(lib, libicuuc_name);
+
+		/* Look up all the functions we need. */
+		lib->open = get_icu_function(lib->libicui18n_handle,
+									 "ucol_open",
+									 major_version);
+		lib->close = get_icu_function(lib->libicui18n_handle,
+									  "ucol_close",
+									  major_version);
+		lib->getVersion = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getVersion",
+										   major_version);
+		lib->versionToString = get_icu_function(lib->libicui18n_handle,
+												"u_versionToString",
+												major_version);
+		lib->strcoll = get_icu_function(lib->libicui18n_handle,
+										"ucol_strcoll",
+										major_version);
+		lib->strcollUTF8 = get_icu_function(lib->libicui18n_handle,
+											"ucol_strcollUTF8",
+											major_version);
+		lib->getSortKey = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getSortKey",
+										   major_version);
+		lib->nextSortKeyPart = get_icu_function(lib->libicui18n_handle,
+												"ucol_nextSortKeyPart",
+												major_version);
+		lib->errorName = get_icu_function(lib->libicui18n_handle,
+										  "u_errorName",
+										  major_version);
+		lib->strToUpper = get_icu_function(lib->libicuuc_handle,
+										   "u_strToUpper",
+										   major_version);
+		lib->strToLower = get_icu_function(lib->libicuuc_handle,
+										   "u_strToLower",
+										   major_version);
+		lib->strToTitle = get_icu_function(lib->libicuuc_handle,
+										   "u_strToTitle",
+										   major_version);
+		if (!lib->open ||
+			!lib->close ||
+			!lib->getVersion ||
+			!lib->versionToString ||
+			!lib->strcoll ||
+			!lib->strcollUTF8 ||
+			!lib->getSortKey ||
+			!lib->nextSortKeyPart ||
+			!lib->errorName ||
+			!lib->strToUpper ||
+			!lib->strToLower ||
+			!lib->strToTitle)
+		{
+			free_icu_library(lib);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in library \"%s\"",
+							libicui18n_name)));
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("dynamically loaded ICU libraries are not supported in this build")));
+#endif
+	}
+
+	lib->next = icu_library_list;
+	icu_library_list = lib;
+
+	return lib;
+}
+
+/*
+ * Look up the library to use for a given collcollate string.
+ */
+static pg_icu_library *
+get_icu_library_for_collation(const char *collcollate, const char **rest)
+{
+	int			major_version;
+	char	   *separator;
+	char	   *after_prefix;
+
+	separator = strchr(collcollate, ':');
+
+	/*
+	 * If it's a traditional value without a prefix, use the library we are
+	 * linked against.
+	 */
+	if (separator == NULL)
+	{
+		*rest = collcollate;
+		return get_icu_library(U_ICU_VERSION_MAJOR_NUM);
+	}
+
+	/* If it has a prefix, interpret it as an ICU major version. */
+	major_version = strtol(collcollate, &after_prefix, 10);
+	if (after_prefix != separator)
+		elog(ERROR,
+			 "could not parse ICU major library version: \"%s\"",
+			 collcollate);
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			 "ICU major library verision out of supported range: \"%s\"",
+			 collcollate);
+
+	/* The part after the separate will be passed to the library. */
+	*rest = separator + 1;
+
+	return get_icu_library(major_version);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(iculocstr, &iculocstr);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = lib->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/*
+	 * XXX can we just drop this cruft and make 54 the minimum supported
+	 * version?
+	 */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.lib = lib;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1688,21 +2012,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *l;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		l = get_icu_library_for_collation(collcollate, &collcollate);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = l->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, l->errorName(status))));
+		l->getVersion(collator, versioninfo);
+		l->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		l->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1770,6 +2096,8 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1991,19 +2319,22 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(icu_locale, &icu_locale);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = lib->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/* XXX can we just drop this cruft? */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	lib->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index bbeb0a2653..187f066881 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1025,11 +1025,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1086,11 +1086,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 919138eaf3..e81efbeb73 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1666,13 +1666,14 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = PG_ICU_LIB(mylocale)->strcollUTF8(PG_ICU_COL(mylocale),
+															   arg1, len1,
+															   arg2, len2,
+															   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
+								(errmsg("collation failed: %s",
+										PG_ICU_LIB(mylocale)->errorName(status))));
 				}
 				else
 #endif
@@ -1685,9 +1686,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = PG_ICU_LIB(mylocale)->strcoll(PG_ICU_COL(mylocale),
+														   uchar1, ulen1,
+														   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2389,13 +2390,14 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = PG_ICU_LIB(sss->locale)->strcollUTF8(PG_ICU_COL(sss->locale),
+															  a1p, len1,
+															  a2p, len2,
+															  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									PG_ICU_LIB(sss->locale)->errorName(status))));
 			}
 			else
 #endif
@@ -2408,9 +2410,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = PG_ICU_LIB(sss->locale)->strcoll(PG_ICU_COL(sss->locale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2574,21 +2576,21 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uiter_setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = PG_ICU_LIB(sss->locale)->nextSortKeyPart(PG_ICU_COL(sss->locale),
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										PG_ICU_LIB(sss->locale)->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = PG_ICU_LIB(sss->locale)->getSortKey(PG_ICU_COL(sss->locale),
+																uchar, ulen,
+																(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e7385faef8..d51a3026f5 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,7 @@
 #endif
 #ifdef USE_ICU
 #include <unicode/ucol.h>
+#include <unicode/ubrk.h>
 #endif
 
 #include "utils/guc.h"
@@ -74,6 +75,68 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	void	   *libicui18n_handle;
+	void	   *libicuuc_handle;
+	UCollator  *(*open) (const char *loc, UErrorCode *status);
+	void		(*close) (UCollator *coll);
+	void		(*getVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+				UCollationResult(*strcoll) (const UCollator *coll,
+											const UChar *source,
+											int32_t sourceLength,
+											const UChar *target,
+											int32_t targetLength);
+				UCollationResult(*strcollUTF8) (const UCollator *coll,
+												const char *source,
+												int32_t sourceLength,
+												const char *target,
+												int32_t targetLength,
+												UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -95,12 +158,18 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+#ifdef USE_ICU
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+#endif
+
 typedef struct pg_locale_struct *pg_locale_t;
 
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4fb746930a..47aba7654b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1100,6 +1100,7 @@ HeapTupleTableSlot
 HistControl
 HotStandbyState
 I32
+ICU_Convert_BI_Func
 ICU_Convert_Func
 ID
 INFIX
@@ -2858,6 +2859,7 @@ TypeName
 U
 U32
 U8
+UBreakIterator
 UChar
 UCharIterator
 UColAttribute
@@ -3485,6 +3487,7 @@ pg_funcptr_t
 pg_gssinfo
 pg_hmac_ctx
 pg_hmac_errno
+pg_icu_library
 pg_int64
 pg_local_to_utf_combined
 pg_locale_t
-- 
2.30.2

#73

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#72)

1 attachment(s)

Re: Collation version tracking for macOS

On Sun, Jun 12, 2022 at 11:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

What about "time travel collations", but without the time travel part?
That is, what about supporting multiple ICU versions per cluster, but
not per database? So you could upgrade the OS and Postgres, using
standard packages that typically just use the latest ICU version --
typically, but not always. If you happen to have been on an older
version of ICU on upgrade, then that version of ICU will still work at
the level of a whole database -- your database. Maybe you can create
new databases with old and new ICU versions if you want to.

That obviously runs into the problem of needing to eventually do a
dump and reload -- but I suppose that "eventually" could be a very
long time. At least the OS package doesn't declare one version of ICU
the blessed version, now and forever, effectively vendoring ICU in a
backdoor fashion. At least old databases have significant runway,
while at the same time new databases that want to use the same
standard Postgres package aren't forced to use the same ancient ICU
version.

Hmm. I think that's effectively what you'd get using my "distinct
collation" patch (v1, or this much better v3, attached), if you put
version prefixes in colliculocale, and updated them in the template
database after an OS upgrade to affect new databases. I realise you
probably mean something a little more automatic...

Thinking some more about what you said above: really, most people only
care about the default collation. I'm not yet sure what I think
initdb should put into pg_collation when importing the initial set of
collation objects in the "distinct" world (perhaps an un-prefixed and
a prefixed variant of each, with names ending -x-icu and -x-icu63?),
but as for the default collation, I should point out that the
"distinct" patch already gives you a nailed-to-the-ground database
approximately as you described above if you just do something like
this:

postgres=# create database db2 locale_provider = icu icu_locale =
'67:en' template = template0 ...;

Small bugfix attached (v3 was accidentally calling uiter_setUTF8() and
u_errorName() directly in a couple of places).

Attachments:

v4-0001-WIP-multi-ICU-distinct-collations.patchtext/x-patch; charset=US-ASCII; name=v4-0001-WIP-multi-ICU-distinct-collations.patchDownload

From 265d6ea0d292255c164015814f4e810186d2ffb3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v4] WIP: multi ICU, distinct collations

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  16 +-
 src/backend/utils/adt/formatting.c |  53 ++++-
 src/backend/utils/adt/pg_locale.c  | 360 ++++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c    |  16 +-
 src/backend/utils/adt/varlena.c    |  56 ++---
 src/include/utils/pg_locale.h      |  72 ++++++
 src/tools/pgindent/typedefs.list   |   3 +
 7 files changed, 510 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index e909c1a200..97c96cb7f6 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1602,6 +1602,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1626,18 +1631,41 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->info.icu.locale, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->info.icu.locale, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1705,7 +1733,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToLower,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1827,7 +1856,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToUpper,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1950,8 +1980,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
+			len_conv = icu_convert_case_bi(PG_ICU_LIB(mylocale)->strToTitle,
+										   mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index a0490a7522..8dec0ff8c9 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/hsearch.h"
@@ -68,6 +69,7 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -78,8 +80,27 @@
 #include <shlwapi.h>
 #endif
 
+#ifdef HAVE_DLOPEN
+#include <dlfcn.h>
+#endif
+
 #define		MAX_L10N_DATA		80
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/* An old ICU release that we know has the right API. */
+/* XXX check if this is actually true, and if we could go back further */
+#define PG_MIN_ICU_MAJOR_VERSION 54
+
+#endif
+
 
 /* GUC settings */
 char	   *locale_messages;
@@ -1435,29 +1456,337 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Linked list of ICU libraries we have loaded. */
+static pg_icu_library *icu_library_list = NULL;
+
+/*
+ * Free an ICU library.  pg_icu_library objects that are successfully
+ * constructed stick around for the lifetime of the backend, but this is used
+ * to clean up if initialization fails.
+ */
+static void
+free_icu_library(pg_icu_library *lib)
+{
+	if (lib->libicui18n_handle)
+		dlclose(lib->libicui18n_handle);
+	if (lib->libicuuc_handle)
+		dlclose(lib->libicuuc_handle);
+	pfree(lib);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char		name[80];
+
+	snprintf(name, sizeof(name), "%s_%d", function, version);
+
+	return dlsym(handle, name);
+}
+
+/*
+ * Probe a dynamically loaded library to see which major version of ICU it
+ * contains.
+ */
+static int
+get_icu_library_major_version(void *handle)
+{
+	for (int i = PG_MIN_ICU_MAJOR_VERSION; i <= PG_MAX_ICU_MAJOR_VERSION; ++i)
+		if (get_icu_function(handle, "ucol_open", i) ||
+			get_icu_function(handle, "u_strToUpper", i))
+			return i;
+
+	/*
+	 * It's a later version we don't dare use, an old version we don't
+	 * support, an ICU build with symbol suffixes disabled, or not ICU.
+	 */
+	return -1;
+}
+
+/*
+ * We have to load a couple of different libraries, so we'll reuse the code to
+ * do that.
+ */
+static void *
+load_icu_library(pg_icu_library *lib, const char *name)
+{
+	void	   *handle;
+	int			found_major_version;
+
+	handle = dlopen(name, RTLD_NOW | RTLD_GLOBAL);
+	if (handle == NULL)
+	{
+		int			errno_save = errno;
+
+		free_icu_library(lib);
+		errno = errno_save;
+
+		ereport(ERROR,
+				(errmsg("could not load library \"%s\": %m", name)));
+	}
+
+	found_major_version = get_icu_library_major_version(handle);
+	if (found_major_version < 0)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("could not find compatible ICU major version in library \"%s\"",
+						name)));
+	}
+
+	if (found_major_version != lib->major_version)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("expected to find ICU major version %d in library \"%s\", but found %d",
+						lib->major_version, name, found_major_version)));
+	}
+
+	return handle;
+}
+
+/*
+ * Given an ICU major version number, return the object we need to access it,
+ * or fail while trying to load it.
+ */
+static pg_icu_library *
+get_icu_library(int major_version)
+{
+	pg_icu_library *lib;
+
+	Assert(major_version >= PG_MIN_ICU_MAJOR_VERSION &&
+		   major_version <= PG_MAX_ICU_MAJOR_VERSION);
+
+	/* Try to find it in our list of existing libraries. */
+	for (lib = icu_library_list; lib; lib = lib->next)
+		if (lib->major_version == major_version)
+			return lib;
+
+	/* Make a new entry. */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(*lib));
+	if (major_version == U_ICU_VERSION_MAJOR_NUM)
+	{
+		/*
+		 * This is the version we were compiled and linked against.  Simply
+		 * assign the function pointers.
+		 *
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		lib->major_version = major_version;
+		lib->open = ucol_open;
+		lib->close = ucol_close;
+		lib->getVersion = ucol_getVersion;
+		lib->versionToString = u_versionToString;
+		lib->strcoll = ucol_strcoll;
+		lib->strcollUTF8 = ucol_strcollUTF8;
+		lib->getSortKey = ucol_getSortKey;
+		lib->nextSortKeyPart = ucol_nextSortKeyPart;
+		lib->setUTF8 = uiter_setUTF8;
+		lib->errorName = u_errorName;
+		lib->strToUpper = u_strToUpper;
+		lib->strToLower = u_strToLower;
+		lib->strToTitle = u_strToTitle;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+#ifdef HAVE_DLOPEN
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+
+		/*
+		 * We don't like to open versions newer than what we're linked
+		 * against, to reduce the risk of an API change biting us.
+		 */
+		if (major_version > U_ICU_VERSION_MAJOR_NUM)
+			elog(ERROR, "ICU major version %d higher than linked version %d, refusing to open",
+				 major_version, U_ICU_VERSION_MAJOR_NUM);
+
+		lib->major_version = major_version;
+
+		/*
+		 * XXX using pkglib_path to look for libraries.  The idea is that an
+		 * administrator would drop symlinks in there.
+		 *
+		 * Perhaps should do something like dfmgr.c's dynamic_library_path
+		 * search to find it?
+		 */
+
+		/* Load the collation library. */
+		snprintf(libicui18n_name,
+				 sizeof(libicui18n_name),
+				 "%s/libicui18n" DLSUFFIX ".%d",
+				 pkglib_path,
+				 major_version);
+		lib->libicui18n_handle = load_icu_library(lib, libicui18n_name);
+
+		/* Load the ctype library. */
+		snprintf(libicuuc_name,
+				 sizeof(libicuuc_name),
+				 "%s/libicuuc" DLSUFFIX ".%d",
+				 pkglib_path,
+				 major_version);
+		lib->libicuuc_handle = load_icu_library(lib, libicuuc_name);
+
+		/* Look up all the functions we need. */
+		lib->open = get_icu_function(lib->libicui18n_handle,
+									 "ucol_open",
+									 major_version);
+		lib->close = get_icu_function(lib->libicui18n_handle,
+									  "ucol_close",
+									  major_version);
+		lib->getVersion = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getVersion",
+										   major_version);
+		lib->versionToString = get_icu_function(lib->libicui18n_handle,
+												"u_versionToString",
+												major_version);
+		lib->strcoll = get_icu_function(lib->libicui18n_handle,
+										"ucol_strcoll",
+										major_version);
+		lib->strcollUTF8 = get_icu_function(lib->libicui18n_handle,
+											"ucol_strcollUTF8",
+											major_version);
+		lib->getSortKey = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getSortKey",
+										   major_version);
+		lib->nextSortKeyPart = get_icu_function(lib->libicui18n_handle,
+												"ucol_nextSortKeyPart",
+												major_version);
+		lib->setUTF8 = get_icu_function(lib->libicui18n_handle,
+										"uiter_setUTF8",
+										major_version);
+		lib->errorName = get_icu_function(lib->libicui18n_handle,
+										  "u_errorName",
+										  major_version);
+		lib->strToUpper = get_icu_function(lib->libicuuc_handle,
+										   "u_strToUpper",
+										   major_version);
+		lib->strToLower = get_icu_function(lib->libicuuc_handle,
+										   "u_strToLower",
+										   major_version);
+		lib->strToTitle = get_icu_function(lib->libicuuc_handle,
+										   "u_strToTitle",
+										   major_version);
+		if (!lib->open ||
+			!lib->close ||
+			!lib->getVersion ||
+			!lib->versionToString ||
+			!lib->strcoll ||
+			!lib->strcollUTF8 ||
+			!lib->getSortKey ||
+			!lib->nextSortKeyPart ||
+			!lib->setUTF8 ||
+			!lib->errorName ||
+			!lib->strToUpper ||
+			!lib->strToLower ||
+			!lib->strToTitle)
+		{
+			free_icu_library(lib);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in library \"%s\"",
+							libicui18n_name)));
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("dynamically loaded ICU libraries are not supported in this build")));
+#endif
+	}
+
+	lib->next = icu_library_list;
+	icu_library_list = lib;
+
+	return lib;
+}
+
+/*
+ * Look up the library to use for a given collcollate string.
+ */
+static pg_icu_library *
+get_icu_library_for_collation(const char *collcollate, const char **rest)
+{
+	int			major_version;
+	char	   *separator;
+	char	   *after_prefix;
+
+	separator = strchr(collcollate, ':');
+
+	/*
+	 * If it's a traditional value without a prefix, use the library we are
+	 * linked against.
+	 */
+	if (separator == NULL)
+	{
+		*rest = collcollate;
+		return get_icu_library(U_ICU_VERSION_MAJOR_NUM);
+	}
+
+	/* If it has a prefix, interpret it as an ICU major version. */
+	major_version = strtol(collcollate, &after_prefix, 10);
+	if (after_prefix != separator)
+		elog(ERROR,
+			 "could not parse ICU major library version: \"%s\"",
+			 collcollate);
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			 "ICU major library verision out of supported range: \"%s\"",
+			 collcollate);
+
+	/* The part after the separate will be passed to the library. */
+	*rest = separator + 1;
+
+	return get_icu_library(major_version);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(iculocstr, &iculocstr);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = lib->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/*
+	 * XXX can we just drop this cruft and make 54 the minimum supported
+	 * version?
+	 */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.lib = lib;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1688,21 +2017,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *lib;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		lib = get_icu_library_for_collation(collcollate, &collcollate);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = lib->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, lib->errorName(status))));
+		lib->getVersion(collator, versioninfo);
+		lib->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		lib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1770,6 +2101,8 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1991,19 +2324,22 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(icu_locale, &icu_locale);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = lib->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/* XXX can we just drop this cruft? */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	lib->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index bbeb0a2653..187f066881 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1025,11 +1025,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1086,11 +1086,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 919138eaf3..627fdeafa1 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1666,13 +1666,14 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = PG_ICU_LIB(mylocale)->strcollUTF8(PG_ICU_COL(mylocale),
+															   arg1, len1,
+															   arg2, len2,
+															   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
+								(errmsg("collation failed: %s",
+										PG_ICU_LIB(mylocale)->errorName(status))));
 				}
 				else
 #endif
@@ -1685,9 +1686,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = PG_ICU_LIB(mylocale)->strcoll(PG_ICU_COL(mylocale),
+														   uchar1, ulen1,
+														   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2389,13 +2390,14 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = PG_ICU_LIB(sss->locale)->strcollUTF8(PG_ICU_COL(sss->locale),
+															  a1p, len1,
+															  a2p, len2,
+															  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									PG_ICU_LIB(sss->locale)->errorName(status))));
 			}
 			else
 #endif
@@ -2408,9 +2410,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = PG_ICU_LIB(sss->locale)->strcoll(PG_ICU_COL(sss->locale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2571,24 +2573,24 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uint32_t	state[2];
 					UErrorCode	status;
 
-					uiter_setUTF8(&iter, sss->buf1, len);
+					PG_ICU_LIB(sss->locale)->setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = PG_ICU_LIB(sss->locale)->nextSortKeyPart(PG_ICU_COL(sss->locale),
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										PG_ICU_LIB(sss->locale)->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = PG_ICU_LIB(sss->locale)->getSortKey(PG_ICU_COL(sss->locale),
+																uchar, ulen,
+																(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e7385faef8..ca345d20af 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,7 @@
 #endif
 #ifdef USE_ICU
 #include <unicode/ucol.h>
+#include <unicode/ubrk.h>
 #endif
 
 #include "utils/guc.h"
@@ -74,6 +75,71 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	void	   *libicui18n_handle;
+	void	   *libicuuc_handle;
+	UCollator  *(*open) (const char *loc, UErrorCode *status);
+	void		(*close) (UCollator *coll);
+	void		(*getVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+				UCollationResult(*strcoll) (const UCollator *coll,
+											const UChar *source,
+											int32_t sourceLength,
+											const UChar *target,
+											int32_t targetLength);
+				UCollationResult(*strcollUTF8) (const UCollator *coll,
+												const char *source,
+												int32_t sourceLength,
+												const char *target,
+												int32_t targetLength,
+												UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -95,12 +161,18 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+#ifdef USE_ICU
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+#endif
+
 typedef struct pg_locale_struct *pg_locale_t;
 
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4fb746930a..47aba7654b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1100,6 +1100,7 @@ HeapTupleTableSlot
 HistControl
 HotStandbyState
 I32
+ICU_Convert_BI_Func
 ICU_Convert_Func
 ID
 INFIX
@@ -2858,6 +2859,7 @@ TypeName
 U
 U32
 U8
+UBreakIterator
 UChar
 UCharIterator
 UColAttribute
@@ -3485,6 +3487,7 @@ pg_funcptr_t
 pg_gssinfo
 pg_hmac_ctx
 pg_hmac_errno
+pg_icu_library
 pg_int64
 pg_local_to_utf_combined
 pg_locale_t
-- 
2.30.2

#74

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Jeremy Schneider (#11)

Re: Collation version tracking for macOS

Hey Jeremy,

On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider
<schneider@ardentperf.com> wrote:

Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material; I haven’t read the whole thread yet. If you have some others that would also be particularly good background, let me know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I was pulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years, so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in the conversation here.

There were more threads, but they mostly say the same things, hence my
current attempt to move from bloviation to trying out the ideas with
actual code :-D

Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn on version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is pretty damn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things to happen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades of large & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I have a lot of reading to do, to really understand how that happened and where the dialogue is today.

Given that the only thing you could do about it is REINDEX, and yet we
don't even know which indexes needed to be REINDEXed (the problem
Julien and I tried to address, but so far without success), it seemed
highly premature to convert the warning to an error.

I don't think the community consensus is that we have arrived
somewhere, it's more like we're in transit, possibly without a map.
For example 15 gains ICU support for the default collation (= how most
people consume collations), which changes things, and will surely lead
to more people thinking about this problem space.

Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but I still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough” for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?)

Hmm. Well I personally wouldn't try to write a collator any more
willingly than I'd try to write a new cryptographic algorithm, just
not my bag. We don't want to handle complaints about our sort order
(we already bat away complaints about glibc's, and I heard an account
from an OS vendor about the non-stop contradictory crank complaints
about sort order they used to receive before they gave up and adopted
ICU). ICU really is quite authoritative here.

If you mean that you don't even want to have to specify a language
like "en", then note that you don't have to: ICU has a "root" collator
which you can request with an empty string (all other collators apply
cultural tweaks on top of that). Like everything else, the root
collator has changed over time, though.

With my "distinct" experimental patch (v4), you could set your
database default collation to a specific ICU major version's root
collator like so:

create database ... locale_provider = icu icu_locale = '71:' template
= template0

That'll keep working, even across pg_upgrades to some version of
PostgreSQL far in the future that is linked against ICU 100, by
dlopen'ing the .71 libraries, for as long as you can get your hands on
a libicu71 package or otherwise build your own, and it'll error out if
it can't open that library, which may be the hard error you were
looking for. If there's an API change in ICU we'll have to make some
changes, but that's already true.

Review/testing/flames/rants/better ideas welcome.

Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for the non-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>).

#75

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#43)

Re: Collation version tracking for macOS

On Thu, Jun 9, 2022 at 11:33 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm sure that Apple are indeed updating the UTF8 data behind
their proprietary i18n APIs, but the libc APIs are mostly getting benign
neglect.

As for how exactly they might be doing that, I don't know, but a bit
of light googling tells me that a private, headerless,
please-don't-call-me-directly copy of ICU arrived back in macOS
10.3[1]. I don't see it on my 12.4 system, but I also know that 12.x
started hiding system libraries completely (the linker is magic and
pulls libraries from some parallel dimension, there is no
/usr/lib/libSystem.B.dylib file on disk, and yet otool -L
<your_favourite_executable> references it).

The other thread about a macOS linking problem nerd-sniped me back
into here to find out how to see breadcrumbs between hidden libraries
on this super weird UNIX™ and confirm that they are indeed still
shipping a private ICU for use by their Core Foundation stuff that's
used by fancy ObjC/Swift/... etc GUI apps. The following command was
an interesting discovery for me because otool -L can't see any of the
new kind of ghost libraries:

% dyld_info -dependents
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation

Though I can't get my hands on the hidden ICU library itself to
disassemble (without installing weird extra tools, apparently [1]https://lapcatsoftware.com/articles/bigsur.html),
that at least revealed its name, which I could then dlopen out of
curiosity. It seems they jammed all the ICU sub-libraries into one,
and configured it with --disable-renaming so it doesn't have major
version suffixes on symbol names.

It'd clearly be a terrible idea for us to try to use any of that, and
Mac users should be very happy with the new support for ICU as DB
default.

[1]: https://lapcatsoftware.com/articles/bigsur.html

#76

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Thomas Munro (#75)

Re: Collation version tracking for macOS

On Mon, Jun 13, 2022 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:

It'd clearly be a terrible idea for us to try to use any of that, and
Mac users should be very happy with the new support for ICU as DB
default.

This suggests something that I already suspected: nobody particularly
expects the system lib C to be authoritative for the OS as a whole, in
the way that Postgres supposes. At least in the case of Mac OS, which
is after all purely a desktop operating system.

--
Peter Geoghegan

#77

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 3 years ago

In reply to: Peter Geoghegan (#67)

Re: Collation version tracking for macOS

On 11.06.22 05:35, Peter Geoghegan wrote:

Do we even need to store a version for indexes most of the time if
we're versioning ICU itself, as part of the "time travelling
collations" design? For that matter, do we even need to version
collations directly anymore?

Conversely, why are we looking at the ICU version instead of the
collation version. If we have recorded the collation as being version
1234, we need to look through the available ICU versions (assuming we
can load multiple ones somehow) and pick the one that provides 1234. It
doesn't matter whether it's the same ICU version that the collation was
originally created with, as long as the collation version stays the same.

#78

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Peter Eisentraut (#77)

Re: Collation version tracking for macOS

On Jun 14, 2022, at 14:10, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

Conversely, why are we looking at the ICU version instead of the collation version. If we have recorded the collation as being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones somehow) and pick the one that provides 1234. It doesn't matter whether it's the same ICU version that the collation was originally created with, as long as the collation version stays the same.

Does Unicode CDLR provide (or even track) versioning of collation or other i18n functionality for individual locale settings? I’m thinking it might not even have that concept in the original source repo/data, but I might be remembering wrong.

It would require not only watching for changes in the per-locale tailoring rules but also being cognizant of changes in root/DUCET behavior and understanding the impact of changes there.

(Common mistake I’ve seen folks make when comparing OS glibc versions is only looking at locale data, not realizing there have been changes to root behavior that didn’t involve any changes to local data files)

-Jeremy

#79

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Jeremy Schneider (#78)

Re: Collation version tracking for macOS

On Wed, Jun 15, 2022 at 7:10 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

On Jun 14, 2022, at 14:10, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
Conversely, why are we looking at the ICU version instead of the collation version. If we have recorded the collation as being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones somehow) and pick the one that provides 1234. It doesn't matter whether it's the same ICU version that the collation was originally created with, as long as the collation version stays the same.

One difference would be the effect if ICU ever ships a minor library
version update that changes the reported collversion.

1. With the code I proposed in my v4 patch, our version mismatch
warnings would kick in, but otherwise everything would continue to
work (and corrupt indexes, if they really moved anything around).
2. With a system that (somehow) opens all available libraries and
looks for match, it would fail to find one. That is assuming that you
are using the typical major-versioned packages we can see in software
distributions like Debian.

I don't know if minor version changes actually do that, though have
wondered out loud a few times in these threads. I might go and poke
at some ancient packages to see if that's happened before. To defend
against that, we could instead do major + minor versioning, but so far
I worried about major only because that's they way they ship 'em in
Debian and (AFAICS) RHEL etc, so if you can't easily install 68.0 and
68.1 at the same time. On the other hand, you could always "pin" (or
similar concepts) the libicu68 package to a specific minor release, to
fix the problem (whether you failed like 1 or like 2 above).

(Common mistake I’ve seen folks make when comparing OS glibc versions is only looking at locale data, not realizing there have been changes to root behavior that didn’t involve any changes to local data files)

Yeah, I've wondered idly before if libc projects and ICU couldn't just
offer a way to ask for versions explicitly, and ship historical data.
With some system of symlinks to make it all work with defaults for
those who don't care, a libc could have
/usr/share/locale/en_US@CLDR34.UTF-8 etc so you could
setlocale(LC_COLLATE, "en_US@CLDR34"), or something. I suppose they
don't want to promise to be able to interpret the old data in future
releases, and, as you say, sometimes the changes are in C code, due to
bugs or algorithm changes, not the data.

#80

Jeremy Schneider

schneider@ardentperf.com

over 3 years ago

In reply to: Thomas Munro (#79)

Re: Collation version tracking for macOS

On Jun 14, 2022, at 19:06, Thomas Munro <thomas.munro@gmail.com> wrote:
One difference would be the effect if ICU ever ships a minor library
version update that changes the reported collversion.

If I’m reading it correctly, ICU would not change collation in major versions, as an explicit matter of policy around DUCET stability and versioning.

https://unicode.org/reports/tr10/#Stable_DUCET

With some system of symlinks to make it all work with defaults for
those who don't care, a libc could have
/usr/share/locale/en_US@CLDR34.UTF-8 etc so you could
setlocale(LC_COLLATE, "en_US@CLDR34"), or something. I suppose they
don't want to promise to be able to interpret the old data in future
releases, and, as you say, sometimes the changes are in C code, due to
bugs or algorithm changes, not the data.

If I understand correctly, files in /usr/share/locale aren’t enough because those only have the tailoring rules, and core algorithm and data (before applying locale-specific tweaks) also change between versions. I’m pretty sure glibc works similar to UCA in this regard (albeit based on ISO 14651 and not CDLR), and the Unicode link above is a good illustration of default collation rules that underly the locale-specific tweaks.

-Jeremy

Sent from my TI-83

Import Notes

Resolved by subject fallback

#81

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 3 years ago

In reply to: Jeremy Schneider (#78)

Re: Collation version tracking for macOS

On 14.06.22 21:10, Jeremy Schneider wrote:

Does Unicode CDLR provide (or even track) versioning of collation or other i18n functionality for individual locale settings?

Yes. You can see that in PostgreSQL as various pre-seeded ICU
collations having different versions.

#82

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Peter Eisentraut (#81)

1 attachment(s)

Re: Collation version tracking for macOS

Hi,

Here is a rebase of this experimental patch. I think the basic
mechanics are promising, but we haven't agreed on a UX. I hope we can
figure this out.

Restating the choice made in this branch of the experiment: Here I
try to be just like DB2 (if I understood its manual correctly).
In DB2, you can use names like "en_US" if you don't care about
changes, and names like "CLDR181_en_US" if you do. It's the user's
choice to use the second kind to avoid "unexpected effects on
applications or database objects" after upgrades. Translated to
PostgreSQL concepts, you can use a database default ICU locale like
"en-US" if you don't care and "67:en-US" if you do, and for COLLATION
objects it's the same. The convention I tried in this patch is that
you use either "en-US-x-icu" (which points to "en-US") or
"en-US-x-icu67" (which points to "67:en-US") depending on whether you
care about this problem.

I recognise that this is a bit cheesy, it's all the user's problem to
deal with or ignore.

An alternative mentioned by Peter E was that the locale names
shouldn't carry the prefix, but somehow we should have a list of ICU
versions to search for a matching datcollversion/collversion. How
would that look? Perhaps a GUC, icu_library_versions = '63, 67, 71'?
There is a currently natural and smallish range of supported versions,
probably something like 54 ... U_ICU_VERSION_MAJOR_NUM, but it seems a
bit weird to try to dlopen ~25 libraries or whatever it might be...
Do you think we should try to code this up?

I haven't tried it, but the main usability problem I predict with that
idea is this: It can cope with a scenario where you created a
database with ICU 63 and started using a default of "en" and maybe
some explicit fr-x-icu or whatever, and then you upgrade to a new
postgres binary using ICU 71, and, as long as you still have ICU 63
installed it'll just magicaly keep using 63, now via dlopen(). But it
doesn't provide a way for me to create a new database that uses 63 on
purpose when I know what I'm doing. There are various reasons I might
want to do that.

Maybe the ideas could be combined? Perhaps "en" means "create using
binary's linked ICU, open using search-by-collversion", while "67:en"
explicitly says which to use?

Changes since last version:

* Now it just uses the default dlopen() search path, unless you set
icu_library_path. Is that a security problem? It's pretty
convenient, because it means you can just "apt-get install libicu63"
(or local equivalent) and that's all, now 63 is available.

* To try the idea out, I made it automatically create "*-x-icu67"
alongside the regular "-x-icu" collation objects at initdb time.

Attachments:

v5-0001-WIP-Multi-version-ICU.patchapplication/x-patch; name=v5-0001-WIP-Multi-version-ICU.patchDownload

From d3e83d0aa5cbb3eb192a2f66d68623cd3b1595b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v5] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

By default, the system linker's default search path is used to find
libraries, but icu_library_path may be used to specify an absolute path
to look in.  ICU libraries are expected to have been built without ICU's
--disable-renaming option.  That is, major versions must use distinct
symbol names.

This arrangement means that at least one major version of ICU is always
available -- the one that PostgreSQL was linked again.  It should be
simple on most software distributions to install extra versions using a
package manager, or to build extra libraries as required, to access
older ICU releases.  For example, on Debian bullseye the packages are
named libicu63, libicu67, libicu71.

In this version of the patch, '63:en' used as a database default locale
or COLLATION object requests ICU library 63, and 'en' requests the
library that is linked against the postgres executable.

XXX Many other designs possible, to discuss!

Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c   |  16 +-
 src/backend/commands/collationcmds.c |  20 ++
 src/backend/utils/adt/formatting.c   |  53 +++-
 src/backend/utils/adt/pg_locale.c    | 364 ++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c      |  16 +-
 src/backend/utils/adt/varlena.c      |  56 +++--
 src/backend/utils/misc/guc_tables.c  |  14 ++
 src/include/utils/pg_locale.h        |  73 ++++++
 src/tools/pgindent/typedefs.list     |   3 +
 9 files changed, 549 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index fcfc02d2ae..26e747d9d7 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -812,6 +812,26 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 					CreateComments(collid, CollationRelationId, 0,
 								   icucomment);
 			}
+
+			/* Also create an object pinned to an ICU major version. */
+			collid = CollationCreate(psprintf("%s-x-icu-%d", langtag, U_ICU_VERSION_MAJOR_NUM),
+									 nspid, GetUserId(),
+									 COLLPROVIDER_ICU, true, -1,
+									 NULL, NULL,
+									 psprintf("%d:%s", U_ICU_VERSION_MAJOR_NUM, iculocstr),
+									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 true, true);
+			if (OidIsValid(collid))
+			{
+				ncreated++;
+
+				CommandCounterIncrement();
+
+				icucomment = get_icu_locale_comment(name);
+				if (icucomment)
+					CreateComments(collid, CollationRelationId, 0,
+								   icucomment);
+			}
 		}
 	}
 #endif							/* USE_ICU */
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 26f498b5df..0c3c7724d7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1599,6 +1599,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1623,18 +1628,41 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->info.icu.locale, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->info.icu.locale, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1702,7 +1730,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToLower,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1824,7 +1853,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToUpper,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1947,8 +1977,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
+			len_conv = icu_convert_case_bi(PG_ICU_LIB(mylocale)->strToTitle,
+										   mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..bf76516406 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/guc_hooks.h"
@@ -69,6 +70,7 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -79,14 +81,31 @@
 #include <shlwapi.h>
 #endif
 
+#include <dlfcn.h>
+
 #define		MAX_L10N_DATA		80
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/* An old ICU release that we know has the right API. */
+#define PG_MIN_ICU_MAJOR_VERSION 54
+
+#endif
+
 
 /* GUC settings */
 char	   *locale_messages;
 char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
+char	   *icu_library_path;
 
 /*
  * lc_time localization cache.
@@ -1398,29 +1417,343 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Linked list of ICU libraries we have loaded. */
+static pg_icu_library *icu_library_list = NULL;
+
+/*
+ * Free an ICU library.  pg_icu_library objects that are successfully
+ * constructed stick around for the lifetime of the backend, but this is used
+ * to clean up if initialization fails.
+ */
+static void
+free_icu_library(pg_icu_library *lib)
+{
+	if (lib->libicui18n_handle)
+		dlclose(lib->libicui18n_handle);
+	if (lib->libicuuc_handle)
+		dlclose(lib->libicuuc_handle);
+	pfree(lib);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char		name[80];
+
+	snprintf(name, sizeof(name), "%s_%d", function, version);
+
+	return dlsym(handle, name);
+}
+
+/*
+ * Probe a dynamically loaded library to see which major version of ICU it
+ * contains.
+ */
+static int
+get_icu_library_major_version(void *handle)
+{
+	for (int i = PG_MIN_ICU_MAJOR_VERSION; i <= PG_MAX_ICU_MAJOR_VERSION; ++i)
+		if (get_icu_function(handle, "ucol_open", i) ||
+			get_icu_function(handle, "u_strToUpper", i))
+			return i;
+
+	/*
+	 * It's a later version we don't dare use, an old version we don't
+	 * support, an ICU build with symbol suffixes disabled, or not ICU.
+	 */
+	return -1;
+}
+
+/*
+ * We have to load a couple of different libraries, so we'll reuse the code to
+ * do that.
+ */
+static void *
+load_icu_library(pg_icu_library *lib, const char *name)
+{
+	void	   *handle;
+	int			found_major_version;
+
+	handle = dlopen(name, RTLD_NOW | RTLD_GLOBAL);
+	if (handle == NULL)
+	{
+		int			errno_save = errno;
+
+		free_icu_library(lib);
+		errno = errno_save;
+
+		ereport(ERROR,
+				(errmsg("could not load library \"%s\": %m", name)));
+	}
+
+	found_major_version = get_icu_library_major_version(handle);
+	if (found_major_version < 0)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("could not find compatible ICU major version in library \"%s\"",
+						name)));
+	}
+
+	if (found_major_version != lib->major_version)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("expected to find ICU major version %d in library \"%s\", but found %d",
+						lib->major_version, name, found_major_version)));
+	}
+
+	return handle;
+}
+
+/*
+ * Given an ICU major version number, return the object we need to access it,
+ * or fail while trying to load it.
+ */
+static pg_icu_library *
+get_icu_library(int major_version)
+{
+	pg_icu_library *lib;
+
+	Assert(major_version >= PG_MIN_ICU_MAJOR_VERSION &&
+		   major_version <= PG_MAX_ICU_MAJOR_VERSION);
+
+	/* Try to find it in our list of existing libraries. */
+	for (lib = icu_library_list; lib; lib = lib->next)
+		if (lib->major_version == major_version)
+			return lib;
+
+	/* Make a new entry. */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(*lib));
+	if (major_version == U_ICU_VERSION_MAJOR_NUM)
+	{
+		/*
+		 * This is the version we were compiled and linked against.  Simply
+		 * assign the function pointers.
+		 *
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		lib->major_version = major_version;
+		lib->open = ucol_open;
+		lib->close = ucol_close;
+		lib->getVersion = ucol_getVersion;
+		lib->versionToString = u_versionToString;
+		lib->strcoll = ucol_strcoll;
+		lib->strcollUTF8 = ucol_strcollUTF8;
+		lib->getSortKey = ucol_getSortKey;
+		lib->nextSortKeyPart = ucol_nextSortKeyPart;
+		lib->setUTF8 = uiter_setUTF8;
+		lib->errorName = u_errorName;
+		lib->strToUpper = u_strToUpper;
+		lib->strToLower = u_strToLower;
+		lib->strToTitle = u_strToTitle;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+
+		/*
+		 * We don't like to open versions newer than what we're linked
+		 * against, to reduce the risk of an API change biting us.
+		 */
+		if (major_version > U_ICU_VERSION_MAJOR_NUM)
+			elog(ERROR, "ICU major version %d higher than linked version %d, refusing to open",
+				 major_version, U_ICU_VERSION_MAJOR_NUM);
+
+		lib->major_version = major_version;
+
+		/*
+		 * See
+		 * https://unicode-org.github.io/icu/userguide/icu4c/packaging.html#icu-versions
+		 * for conventions on library naming on POSIX and Windows systems.
+		 */
+
+		/* Load the collation library. */
+		snprintf(libicui18n_name,
+				 sizeof(libicui18n_name),
+#ifdef WIN32
+				 "%s%sicui18n%d." DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "\\" : "",
+#else
+				 "%s%slibicui18n" DLSUFFIX ".%d",
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#endif
+				 major_version);
+		lib->libicui18n_handle = load_icu_library(lib, libicui18n_name);
+
+		/* Load the ctype library. */
+		snprintf(libicuuc_name,
+				 sizeof(libicuuc_name),
+#ifdef WIN32
+				 "%s%sicuuc%d." DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "\\" : "",
+#else
+				 "%s%slibicuuc" DLSUFFIX ".%d",
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#endif
+				 major_version);
+		lib->libicuuc_handle = load_icu_library(lib, libicuuc_name);
+
+		/* Look up all the functions we need. */
+		lib->open = get_icu_function(lib->libicui18n_handle,
+									 "ucol_open",
+									 major_version);
+		lib->close = get_icu_function(lib->libicui18n_handle,
+									  "ucol_close",
+									  major_version);
+		lib->getVersion = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getVersion",
+										   major_version);
+		lib->versionToString = get_icu_function(lib->libicui18n_handle,
+												"u_versionToString",
+												major_version);
+		lib->strcoll = get_icu_function(lib->libicui18n_handle,
+										"ucol_strcoll",
+										major_version);
+		lib->strcollUTF8 = get_icu_function(lib->libicui18n_handle,
+											"ucol_strcollUTF8",
+											major_version);
+		lib->getSortKey = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getSortKey",
+										   major_version);
+		lib->nextSortKeyPart = get_icu_function(lib->libicui18n_handle,
+												"ucol_nextSortKeyPart",
+												major_version);
+		lib->setUTF8 = get_icu_function(lib->libicui18n_handle,
+										"uiter_setUTF8",
+										major_version);
+		lib->errorName = get_icu_function(lib->libicui18n_handle,
+										  "u_errorName",
+										  major_version);
+		lib->strToUpper = get_icu_function(lib->libicuuc_handle,
+										   "u_strToUpper",
+										   major_version);
+		lib->strToLower = get_icu_function(lib->libicuuc_handle,
+										   "u_strToLower",
+										   major_version);
+		lib->strToTitle = get_icu_function(lib->libicuuc_handle,
+										   "u_strToTitle",
+										   major_version);
+		if (!lib->open ||
+			!lib->close ||
+			!lib->getVersion ||
+			!lib->versionToString ||
+			!lib->strcoll ||
+			!lib->strcollUTF8 ||
+			!lib->getSortKey ||
+			!lib->nextSortKeyPart ||
+			!lib->setUTF8 ||
+			!lib->errorName ||
+			!lib->strToUpper ||
+			!lib->strToLower ||
+			!lib->strToTitle)
+		{
+			free_icu_library(lib);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in library \"%s\"",
+							libicui18n_name)));
+		}
+	}
+
+	lib->next = icu_library_list;
+	icu_library_list = lib;
+
+	return lib;
+}
+
+/*
+ * Look up the library to use for a given collcollate string.
+ */
+static pg_icu_library *
+get_icu_library_for_collation(const char *collcollate, const char **rest)
+{
+	int			major_version;
+	char	   *separator;
+	char	   *after_prefix;
+
+	separator = strchr(collcollate, ':');
+
+	/*
+	 * If it's a traditional value without a prefix, use the library we are
+	 * linked against.
+	 */
+	if (separator == NULL)
+	{
+		*rest = collcollate;
+		return get_icu_library(U_ICU_VERSION_MAJOR_NUM);
+	}
+
+	/* If it has a prefix, interpret it as an ICU major version. */
+	major_version = strtol(collcollate, &after_prefix, 10);
+	if (after_prefix != separator)
+		elog(ERROR,
+			 "could not parse ICU major library version: \"%s\"",
+			 collcollate);
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			 "ICU major library verision out of supported range: \"%s\"",
+			 collcollate);
+
+	/* The part after the separate will be passed to the library. */
+	*rest = separator + 1;
+
+	return get_icu_library(major_version);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(iculocstr, &iculocstr);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = lib->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/*
+	 * XXX can we just drop this cruft and make 54 the minimum supported
+	 * version?
+	 */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.lib = lib;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1651,21 +1984,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *lib;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		lib = get_icu_library_for_collation(collcollate, &collcollate);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = lib->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, lib->errorName(status))));
+		lib->getVersion(collator, versioninfo);
+		lib->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		lib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1733,6 +2068,8 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1954,19 +2291,22 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(icu_locale, &icu_locale);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = lib->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/* XXX can we just drop this cruft? */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	lib->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 68e2e6f7a7..e0c86870e0 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1026,11 +1026,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1087,11 +1087,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c5e7ee7ca2..cf891a5654 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1667,13 +1667,14 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = PG_ICU_LIB(mylocale)->strcollUTF8(PG_ICU_COL(mylocale),
+															   arg1, len1,
+															   arg2, len2,
+															   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
+								(errmsg("collation failed: %s",
+										PG_ICU_LIB(mylocale)->errorName(status))));
 				}
 				else
 #endif
@@ -1686,9 +1687,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = PG_ICU_LIB(mylocale)->strcoll(PG_ICU_COL(mylocale),
+														   uchar1, ulen1,
+														   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2388,13 +2389,14 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = PG_ICU_LIB(sss->locale)->strcollUTF8(PG_ICU_COL(sss->locale),
+															  a1p, len1,
+															  a2p, len2,
+															  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									PG_ICU_LIB(sss->locale)->errorName(status))));
 			}
 			else
 #endif
@@ -2407,9 +2409,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = PG_ICU_LIB(sss->locale)->strcoll(PG_ICU_COL(sss->locale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2569,24 +2571,24 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uint32_t	state[2];
 					UErrorCode	status;
 
-					uiter_setUTF8(&iter, sss->buf1, len);
+					PG_ICU_LIB(sss->locale)->setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = PG_ICU_LIB(sss->locale)->nextSortKeyPart(PG_ICU_COL(sss->locale),
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										PG_ICU_LIB(sss->locale)->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = PG_ICU_LIB(sss->locale)->getSortKey(PG_ICU_COL(sss->locale),
+																uchar, ulen,
+																(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934..e60081c384 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3922,6 +3922,20 @@ struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"icu_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
+			gettext_noop("Sets the path for dynamically loadable ICU libraries."),
+			gettext_noop("If versions of ICU other than the one that "
+						 "PostgreSQL is linked against, they will be open "
+						 "from this path.  If empty, the system linker search "
+						 "path will be used."),
+			GUC_SUPERUSER_ONLY
+		},
+		&icu_library_path,
+		"",
+		NULL, NULL, NULL
+	},
+
 	{
 		{"krb_server_keyfile", PGC_SIGHUP, CONN_AUTH_AUTH,
 			gettext_noop("Sets the location of the Kerberos server key file."),
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..59613e4f56 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,7 @@
 #endif
 #ifdef USE_ICU
 #include <unicode/ucol.h>
+#include <unicode/ubrk.h>
 #endif
 
 #ifdef USE_ICU
@@ -40,6 +41,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT char *icu_library_path;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -63,6 +65,71 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	void	   *libicui18n_handle;
+	void	   *libicuuc_handle;
+	UCollator  *(*open) (const char *loc, UErrorCode *status);
+	void		(*close) (UCollator *coll);
+	void		(*getVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+				UCollationResult(*strcoll) (const UCollator *coll,
+											const UChar *source,
+											int32_t sourceLength,
+											const UChar *target,
+											int32_t targetLength);
+				UCollationResult(*strcollUTF8) (const UCollator *coll,
+												const char *source,
+												int32_t sourceLength,
+												const char *target,
+												int32_t targetLength,
+												UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -84,12 +151,18 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+#ifdef USE_ICU
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+#endif
+
 typedef struct pg_locale_struct *pg_locale_t;
 
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b839c979..0ccbbb711e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1101,6 +1101,7 @@ HeapTupleTableSlot
 HistControl
 HotStandbyState
 I32
+ICU_Convert_BI_Func
 ICU_Convert_Func
 ID
 INFIX
@@ -2854,6 +2855,7 @@ TypeName
 U
 U32
 U8
+UBreakIterator
 UChar
 UCharIterator
 UColAttribute
@@ -3482,6 +3484,7 @@ pg_funcptr_t
 pg_gssinfo
 pg_hmac_ctx
 pg_hmac_errno
+pg_icu_library
 pg_int64
 pg_local_to_utf_combined
 pg_locale_t
-- 
2.30.2

#83

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#82)

1 attachment(s)

Re: Collation version tracking for macOS

On Sat, Oct 22, 2022 at 10:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... But it
doesn't provide a way for me to create a new database that uses 63 on
purpose when I know what I'm doing. There are various reasons I might
want to do that.

Thinking some more about this, I guess that could be addressed by
having an explicit way to request either the library version or
collversion-style version when creating a database or collation, but
not actually storing it in daticulocale/colliculocale. That could be
done either as part of the string that is trimmed off before storing
it (so it's only used briefly during creation to find a non-default
library)... Perhaps that'd look like initdb --icu-locale "67:en" (ICU
library version) or "154.14:en" (individual collation version) or some
new syntax in a few places. Thereafter, it would always be looked up
by searching for the right library by [dat]collversion as Peter E
suggested.

Let me try harder to vocalise some more thoughts that have stopped me
from trying to code the search-by-collversion design so far:

Suppose your pgdata encounters a PostgreSQL linked against a later ICU
library, most likely after an OS upgrade or migratoin, a pg_upgrade,
or via streaming replication. You might get a new error "can't find
ICU collation 'en' with version '153.14'; HINT: install missing ICU
library version", and somehow you'll have to work out which one might
contain 'en' v153.14 and install it with apt-get etc. Then it'll
magically work: your postgres linked against (say) 71 will happily
work with the dlopen'd 67. This is enough if you want to stay on 67
until the heat death of the universe. So far so good.

Problem 1: Suppose you're ready to start using (say) v72. I guess
you'd use the REFRESH command, which would open the main linked ICU's
collversion and stamp that into the catalogue, at which point new
sessions would start using that, and then you'd have to rebuild all
your indexes (with no help from PG to tell you how to find everything
that needs to be rebuilt, as belaboured in previous reverted work).
Aside from the possibility of getting the rebuilding job wrong (as
belaboured elsewhere), it's not great, because there is still a
transitional period where you can be using the wrong version for your
data. So this requires some careful planning and understanding from
the administrator.

I admit that the upgrade story is a tiny bit better than the v5
DB2-style patch, which starts using the new version immediately if you
didn't use a prefix (and logs the usual warnings about collversion
mismatch) instead of waiting for you to run REFRESH. But both of them
have a phase where they might use the wrong library to access an
index. That's dissatisfying, and leads me to prefer the simple
DB2-style solution that at least admits up front that it's not very
clever. The DB2-style patch could be improved a bit here with the
addition of one more GUC: default_icu_library, so the administrator,
rather than the packager, remains in control of which version we use
for non-prefixed iculocale values (likely to be what almost everyone
is interested in), defaulting to what the packager linked against.
I've added that to the patch for illustration (though obviously the
error messages produced by collversion mismatch could use some
adjustment, ie to clarify that the warning might be cleared by
installing and selecting a different library version).

Problem 2: If ICU 67 ever decides to report a different version for a
given collation (would it ever do that? I don't expect so, but ...),
we'd be unable to open the collation with the search-by-collversion
design, and potentially the database. What is a user supposed to do
then? Presumably our error/hint for that would be "please insert the
correct ICU library into drive A", but now there is no correct
library; if you can even diagnose what's happened, I guess you might
downgrade the ICU library using package tools or whatever if possible,
but otherwise you'd be stuck, if you just can't get the right library.
Is this a problem? Would you want to be able to say "I don't care,
computer, please just press on"? So I think we need a way to turn off
the search-by-collversion thing. How should it look?

I'd love to hear others' thoughts on how we can turn this into a
workable solution. Hopefully while staying simple...

Attachments:

v6-0001-WIP-Multi-version-ICU.patchtext/x-patch; charset=US-ASCII; name=v6-0001-WIP-Multi-version-ICU.patchDownload

From 0355984c9a80ff15bfac51677fea30b9be68226b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v6] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

By default, the system linker's default search path is used to find
libraries, but icu_library_path may be used to specify an absolute path
to look in.  ICU libraries are expected to have been built without ICU's
--disable-renaming option.  That is, major versions must use distinct
symbol names.

This arrangement means that at least one major version of ICU is always
available -- the one that PostgreSQL was linked again.  It should be
simple on most software distributions to install extra versions using a
package manager, or to build extra libraries as required, to access
older ICU releases.  For example, on Debian bullseye the packages are
named libicu63, libicu67, libicu71.

In this version of the patch, '63:en' used as a database default locale
or COLLATION object requests ICU library 63, and 'en' requests the
library version seleted by the GUC default_icu_library_version,
defaulting to the version that the executable is linked against.

XXX Many other designs possible, to discuss!

Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c            |  16 +-
 src/backend/commands/collationcmds.c          |  20 +
 src/backend/utils/adt/formatting.c            |  53 ++-
 src/backend/utils/adt/pg_locale.c             | 376 +++++++++++++++++-
 src/backend/utils/adt/varchar.c               |  16 +-
 src/backend/utils/adt/varlena.c               |  56 +--
 src/backend/utils/misc/guc_tables.c           |  28 ++
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/utils/pg_locale.h                 |  74 ++++
 src/tools/pgindent/typedefs.list              |   3 +
 10 files changed, 581 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index fcfc02d2ae..26e747d9d7 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -812,6 +812,26 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 					CreateComments(collid, CollationRelationId, 0,
 								   icucomment);
 			}
+
+			/* Also create an object pinned to an ICU major version. */
+			collid = CollationCreate(psprintf("%s-x-icu-%d", langtag, U_ICU_VERSION_MAJOR_NUM),
+									 nspid, GetUserId(),
+									 COLLPROVIDER_ICU, true, -1,
+									 NULL, NULL,
+									 psprintf("%d:%s", U_ICU_VERSION_MAJOR_NUM, iculocstr),
+									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 true, true);
+			if (OidIsValid(collid))
+			{
+				ncreated++;
+
+				CommandCounterIncrement();
+
+				icucomment = get_icu_locale_comment(name);
+				if (icucomment)
+					CreateComments(collid, CollationRelationId, 0,
+								   icucomment);
+			}
 		}
 	}
 #endif							/* USE_ICU */
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 26f498b5df..0c3c7724d7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1599,6 +1599,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1623,18 +1628,41 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->info.icu.locale, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->info.icu.locale, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1702,7 +1730,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToLower,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1824,7 +1853,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToUpper,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1947,8 +1977,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
+			len_conv = icu_convert_case_bi(PG_ICU_LIB(mylocale)->strToTitle,
+										   mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..666a79b907 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/guc_hooks.h"
@@ -69,6 +70,7 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -79,14 +81,32 @@
 #include <shlwapi.h>
 #endif
 
+#include <dlfcn.h>
+
 #define		MAX_L10N_DATA		80
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/* An old ICU release that we know has the right API. */
+#define PG_MIN_ICU_MAJOR_VERSION 54
+
+#endif
+
 
 /* GUC settings */
 char	   *locale_messages;
 char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
+char	   *icu_library_path;
+int			default_icu_library_version;
 
 /*
  * lc_time localization cache.
@@ -1398,29 +1418,354 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Linked list of ICU libraries we have loaded. */
+static pg_icu_library *icu_library_list = NULL;
+
+/*
+ * Free an ICU library.  pg_icu_library objects that are successfully
+ * constructed stick around for the lifetime of the backend, but this is used
+ * to clean up if initialization fails.
+ */
+static void
+free_icu_library(pg_icu_library *lib)
+{
+	if (lib->libicui18n_handle)
+		dlclose(lib->libicui18n_handle);
+	if (lib->libicuuc_handle)
+		dlclose(lib->libicuuc_handle);
+	pfree(lib);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char		name[80];
+
+	snprintf(name, sizeof(name), "%s_%d", function, version);
+
+	return dlsym(handle, name);
+}
+
+/*
+ * Probe a dynamically loaded library to see which major version of ICU it
+ * contains.
+ */
+static int
+get_icu_library_major_version(void *handle)
+{
+	for (int i = PG_MIN_ICU_MAJOR_VERSION; i <= PG_MAX_ICU_MAJOR_VERSION; ++i)
+		if (get_icu_function(handle, "ucol_open", i) ||
+			get_icu_function(handle, "u_strToUpper", i))
+			return i;
+
+	/*
+	 * It's a later version we don't dare use, an old version we don't
+	 * support, an ICU build with symbol suffixes disabled, or not ICU.
+	 */
+	return -1;
+}
+
+/*
+ * We have to load a couple of different libraries, so we'll reuse the code to
+ * do that.
+ */
+static void *
+load_icu_library(pg_icu_library *lib, const char *name)
+{
+	void	   *handle;
+	int			found_major_version;
+
+	handle = dlopen(name, RTLD_NOW | RTLD_GLOBAL);
+	if (handle == NULL)
+	{
+		int			errno_save = errno;
+
+		free_icu_library(lib);
+		errno = errno_save;
+
+		ereport(ERROR,
+				(errmsg("could not load library \"%s\": %m", name)));
+	}
+
+	found_major_version = get_icu_library_major_version(handle);
+	if (found_major_version < 0)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("could not find compatible ICU major version in library \"%s\"",
+						name)));
+	}
+
+	if (found_major_version != lib->major_version)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("expected to find ICU major version %d in library \"%s\", but found %d",
+						lib->major_version, name, found_major_version)));
+	}
+
+	return handle;
+}
+
+/*
+ * Given an ICU major version number, return the object we need to access it,
+ * or fail while trying to load it.
+ */
+static pg_icu_library *
+get_icu_library(int major_version)
+{
+	pg_icu_library *lib;
+
+	/* XXX Move range check into guc_table.c? */
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			"ICU version must be between %d and %d",
+			 PG_MIN_ICU_MAJOR_VERSION,
+			 PG_MAX_ICU_MAJOR_VERSION);
+
+	/* Try to find it in our list of existing libraries. */
+	for (lib = icu_library_list; lib; lib = lib->next)
+		if (lib->major_version == major_version)
+			return lib;
+
+	/* Make a new entry. */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(*lib));
+	if (major_version == U_ICU_VERSION_MAJOR_NUM)
+	{
+		/*
+		 * This is the version we were compiled and linked against.  Simply
+		 * assign the function pointers.
+		 *
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		lib->major_version = major_version;
+		lib->open = ucol_open;
+		lib->close = ucol_close;
+		lib->getVersion = ucol_getVersion;
+		lib->versionToString = u_versionToString;
+		lib->strcoll = ucol_strcoll;
+		lib->strcollUTF8 = ucol_strcollUTF8;
+		lib->getSortKey = ucol_getSortKey;
+		lib->nextSortKeyPart = ucol_nextSortKeyPart;
+		lib->setUTF8 = uiter_setUTF8;
+		lib->errorName = u_errorName;
+		lib->strToUpper = u_strToUpper;
+		lib->strToLower = u_strToLower;
+		lib->strToTitle = u_strToTitle;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+
+		/*
+		 * We don't like to open versions newer than what we're linked
+		 * against, to reduce the risk of an API change biting us.
+		 */
+		if (major_version > U_ICU_VERSION_MAJOR_NUM)
+			elog(ERROR, "ICU major version %d higher than linked version %d, refusing to open",
+				 major_version, U_ICU_VERSION_MAJOR_NUM);
+
+		lib->major_version = major_version;
+
+		/*
+		 * See
+		 * https://unicode-org.github.io/icu/userguide/icu4c/packaging.html#icu-versions
+		 * for conventions on library naming on POSIX and Windows systems.
+		 */
+
+		/* Load the collation library. */
+		snprintf(libicui18n_name,
+				 sizeof(libicui18n_name),
+#ifdef WIN32
+				 "%s%sicui18n%d." DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "\\" : "",
+#else
+				 "%s%slibicui18n" DLSUFFIX ".%d",
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#endif
+				 major_version);
+		lib->libicui18n_handle = load_icu_library(lib, libicui18n_name);
+
+		/* Load the ctype library. */
+		snprintf(libicuuc_name,
+				 sizeof(libicuuc_name),
+#ifdef WIN32
+				 "%s%sicuuc%d." DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "\\" : "",
+#else
+				 "%s%slibicuuc" DLSUFFIX ".%d",
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#endif
+				 major_version);
+		lib->libicuuc_handle = load_icu_library(lib, libicuuc_name);
+
+		/* Look up all the functions we need. */
+		lib->open = get_icu_function(lib->libicui18n_handle,
+									 "ucol_open",
+									 major_version);
+		lib->close = get_icu_function(lib->libicui18n_handle,
+									  "ucol_close",
+									  major_version);
+		lib->getVersion = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getVersion",
+										   major_version);
+		lib->versionToString = get_icu_function(lib->libicui18n_handle,
+												"u_versionToString",
+												major_version);
+		lib->strcoll = get_icu_function(lib->libicui18n_handle,
+										"ucol_strcoll",
+										major_version);
+		lib->strcollUTF8 = get_icu_function(lib->libicui18n_handle,
+											"ucol_strcollUTF8",
+											major_version);
+		lib->getSortKey = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getSortKey",
+										   major_version);
+		lib->nextSortKeyPart = get_icu_function(lib->libicui18n_handle,
+												"ucol_nextSortKeyPart",
+												major_version);
+		lib->setUTF8 = get_icu_function(lib->libicui18n_handle,
+										"uiter_setUTF8",
+										major_version);
+		lib->errorName = get_icu_function(lib->libicui18n_handle,
+										  "u_errorName",
+										  major_version);
+		lib->strToUpper = get_icu_function(lib->libicuuc_handle,
+										   "u_strToUpper",
+										   major_version);
+		lib->strToLower = get_icu_function(lib->libicuuc_handle,
+										   "u_strToLower",
+										   major_version);
+		lib->strToTitle = get_icu_function(lib->libicuuc_handle,
+										   "u_strToTitle",
+										   major_version);
+		if (!lib->open ||
+			!lib->close ||
+			!lib->getVersion ||
+			!lib->versionToString ||
+			!lib->strcoll ||
+			!lib->strcollUTF8 ||
+			!lib->getSortKey ||
+			!lib->nextSortKeyPart ||
+			!lib->setUTF8 ||
+			!lib->errorName ||
+			!lib->strToUpper ||
+			!lib->strToLower ||
+			!lib->strToTitle)
+		{
+			free_icu_library(lib);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in library \"%s\"",
+							libicui18n_name)));
+		}
+	}
+
+	lib->next = icu_library_list;
+	icu_library_list = lib;
+
+	return lib;
+}
+
+/*
+ * Look up the library to use for a given collcollate string.
+ */
+static pg_icu_library *
+get_icu_library_for_collation(const char *collcollate, const char **rest)
+{
+	int			major_version;
+	char	   *separator;
+	char	   *after_prefix;
+
+	separator = strchr(collcollate, ':');
+
+	/*
+	 * If it's a traditional value without a prefix, use the default ICU
+	 * library.  That's the one we were linked against, or another one if
+	 * default_icu_library_version has been set.
+	 */
+	if (separator == NULL)
+	{
+		*rest = collcollate;
+
+		if (default_icu_library_version > 0)
+			major_version = default_icu_library_version;
+		else
+			major_version = U_ICU_VERSION_MAJOR_NUM;
+		return get_icu_library(major_version);
+	}
+
+	/* If it has a prefix, interpret it as an ICU major version. */
+	major_version = strtol(collcollate, &after_prefix, 10);
+	if (after_prefix != separator)
+		elog(ERROR,
+			 "could not parse ICU major library version: \"%s\"",
+			 collcollate);
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			 "ICU major library verision out of supported range: \"%s\"",
+			 collcollate);
+
+	/* The part after the separate will be passed to the library. */
+	*rest = separator + 1;
+
+	return get_icu_library(major_version);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(iculocstr, &iculocstr);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = lib->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/*
+	 * XXX can we just drop this cruft and make 54 the minimum supported
+	 * version?
+	 */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.lib = lib;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1651,21 +1996,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *lib;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		lib = get_icu_library_for_collation(collcollate, &collcollate);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = lib->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, lib->errorName(status))));
+		lib->getVersion(collator, versioninfo);
+		lib->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		lib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1733,6 +2080,8 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1954,19 +2303,22 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(icu_locale, &icu_locale);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = lib->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	/* XXX can we just drop this cruft? */
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	lib->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 68e2e6f7a7..e0c86870e0 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1026,11 +1026,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1087,11 +1087,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c5e7ee7ca2..cf891a5654 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1667,13 +1667,14 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = PG_ICU_LIB(mylocale)->strcollUTF8(PG_ICU_COL(mylocale),
+															   arg1, len1,
+															   arg2, len2,
+															   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
+								(errmsg("collation failed: %s",
+										PG_ICU_LIB(mylocale)->errorName(status))));
 				}
 				else
 #endif
@@ -1686,9 +1687,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = PG_ICU_LIB(mylocale)->strcoll(PG_ICU_COL(mylocale),
+														   uchar1, ulen1,
+														   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2388,13 +2389,14 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = PG_ICU_LIB(sss->locale)->strcollUTF8(PG_ICU_COL(sss->locale),
+															  a1p, len1,
+															  a2p, len2,
+															  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									PG_ICU_LIB(sss->locale)->errorName(status))));
 			}
 			else
 #endif
@@ -2407,9 +2409,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = PG_ICU_LIB(sss->locale)->strcoll(PG_ICU_COL(sss->locale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2569,24 +2571,24 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uint32_t	state[2];
 					UErrorCode	status;
 
-					uiter_setUTF8(&iter, sss->buf1, len);
+					PG_ICU_LIB(sss->locale)->setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = PG_ICU_LIB(sss->locale)->nextSortKeyPart(PG_ICU_COL(sss->locale),
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										PG_ICU_LIB(sss->locale)->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = PG_ICU_LIB(sss->locale)->getSortKey(PG_ICU_COL(sss->locale),
+																uchar, ulen,
+																(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934..9489268b39 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2939,6 +2939,20 @@ struct config_int ConfigureNamesInt[] =
 		check_max_worker_processes, NULL, NULL
 	},
 
+	{
+		{"default_icu_library_version",
+			PGC_SIGHUP,
+			COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Default major version of ICU library to use for collations if not specified."),
+			NULL
+		},
+		&default_icu_library_version,
+		0,
+		0,
+		1000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"max_logical_replication_workers",
 			PGC_POSTMASTER,
@@ -3922,6 +3936,20 @@ struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"icu_library_path", PGC_SUSET, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Sets the path for dynamically loadable ICU libraries."),
+			gettext_noop("If versions of ICU other than the one that "
+						 "PostgreSQL is linked against are needed, they will "
+						 "be opened from this directory.  If empty, the "
+						 "system linker search path will be used."),
+			GUC_SUPERUSER_ONLY
+		},
+		&icu_library_path,
+		"",
+		NULL, NULL, NULL
+	},
+
 	{
 		{"krb_server_keyfile", PGC_SIGHUP, CONN_AUTH_AUTH,
 			gettext_noop("Sets the location of the Kerberos server key file."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351..2713c92124 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -727,6 +727,11 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#default_icu_library_version = 0	# default major version of ICU library
+					# (0 for the linked version)
+#icu_library_path = ''			# path for dynamically loaded ICU
+					# libraries
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..d26e5738f9 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,7 @@
 #endif
 #ifdef USE_ICU
 #include <unicode/ucol.h>
+#include <unicode/ubrk.h>
 #endif
 
 #ifdef USE_ICU
@@ -40,6 +41,8 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT char *icu_library_path;
+extern PGDLLIMPORT int default_icu_library_version;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -63,6 +66,71 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	void	   *libicui18n_handle;
+	void	   *libicuuc_handle;
+	UCollator  *(*open) (const char *loc, UErrorCode *status);
+	void		(*close) (UCollator *coll);
+	void		(*getVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+				UCollationResult(*strcoll) (const UCollator *coll,
+											const UChar *source,
+											int32_t sourceLength,
+											const UChar *target,
+											int32_t targetLength);
+				UCollationResult(*strcollUTF8) (const UCollator *coll,
+												const char *source,
+												int32_t sourceLength,
+												const char *target,
+												int32_t targetLength,
+												UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -84,12 +152,18 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+#ifdef USE_ICU
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+#endif
+
 typedef struct pg_locale_struct *pg_locale_t;
 
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9b839c979..0ccbbb711e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1101,6 +1101,7 @@ HeapTupleTableSlot
 HistControl
 HotStandbyState
 I32
+ICU_Convert_BI_Func
 ICU_Convert_Func
 ID
 INFIX
@@ -2854,6 +2855,7 @@ TypeName
 U
 U32
 U8
+UBreakIterator
 UChar
 UCharIterator
 UColAttribute
@@ -3482,6 +3484,7 @@ pg_funcptr_t
 pg_gssinfo
 pg_hmac_ctx
 pg_hmac_errno
+pg_icu_library
 pg_int64
 pg_local_to_utf_combined
 pg_locale_t
-- 
2.30.2

#84

Peter Eisentraut

peter.eisentraut@enterprisedb.com

about 3 years ago

In reply to: Thomas Munro (#83)

Re: Collation version tracking for macOS

On 22.10.22 03:22, Thomas Munro wrote:

Suppose your pgdata encounters a PostgreSQL linked against a later ICU
library, most likely after an OS upgrade or migratoin, a pg_upgrade,
or via streaming replication. You might get a new error "can't find
ICU collation 'en' with version '153.14'; HINT: install missing ICU
library version", and somehow you'll have to work out which one might
contain 'en' v153.14 and install it with apt-get etc. Then it'll
magically work: your postgres linked against (say) 71 will happily
work with the dlopen'd 67. This is enough if you want to stay on 67
until the heat death of the universe. So far so good.

What I'm wondering is where those ICU installations are going to come
from. In order for this project to be viable, we would need to convince
some combination of ICU maintainers, OS packagers, and PGDG packagers to
provide and maintain five year's worth of ICU packages (yearly releases
AFAICT). Is that something we are willing to get into?

(Even to test this I need to figure out where to get another ICU
installation from. I'll try how easy manual installations are.)

#85

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Peter Eisentraut (#84)

Re: Collation version tracking for macOS

On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

What I'm wondering is where those ICU installations are going to come
from. In order for this project to be viable, we would need to convince
some combination of ICU maintainers, OS packagers, and PGDG packagers to
provide and maintain five year's worth of ICU packages (yearly releases
AFAICT). Is that something we are willing to get into?

I hacked on this on a Debian machine that has a couple of these
installed and they work fine, but now I realise that might have to do
with the major upgrade history of the machine. So yeah... probably.
:-/ Not being involved in packaging I have no idea how plausible such
a backports (erm, forwardports?) repo would be, and I have even less
idea for other distros.

#86

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#85)

Re: Collation version tracking for macOS

On Wed, Nov 2, 2022 at 1:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

What I'm wondering is where those ICU installations are going to come
from. In order for this project to be viable, we would need to convince
some combination of ICU maintainers, OS packagers, and PGDG packagers to
provide and maintain five year's worth of ICU packages (yearly releases
AFAICT). Is that something we are willing to get into?

I hacked on this on a Debian machine that has a couple of these
installed and they work fine, but now I realise that might have to do
with the major upgrade history of the machine. So yeah... probably.
:-/ Not being involved in packaging I have no idea how plausible such
a backports (erm, forwardports?) repo would be, and I have even less
idea for other distros.

After sleeping on it, I don't really agree that the project is not
viable even if it requires hoop-jumping to set up right now. It's a
chicken-and-egg problem, and the first step is to make it possible to
do it at all, thereby creating the demand for convenient packages. I
think we have several topics here:

1. Technical problems relating to dlopen'ing. Does it work? Is the
default dlopen() secure enough? Is it building sensible library
names, even on the freaky-library OSes (Windows, macOS, AIX)? Is it
enough to have that GUC for non-default path, should it be a search
path, should it share the existing dynamic_library_path? Are the
indirect function calls fast enough? Is the way it handles API
stability sound? Can we drop some unfinished complexity by dropping
pre-53 ICU? Does it use too much memory?
2. User experience problems relating to upgrade paths and user
interface. Is it enough to start with the basic DB2-style approach
that I've prototyped here? How should we refer to library versions?
Is your search-for-the-collversion idea better? My gut feeling is
that the early version should be about giving people options, and not
trying to be too clever/automatic with questionable semantics, and
later improvements could follow, for example if we have another go at
the per-object version tracking.
3. Library availability. This is a problem for downstream
communities to solve. For example, the people who build Windows
installers might want to start bundling the ICU versions from their
earlier releases, the people involved with each Linux/BSD distro would
hopefully figure out a good way to publish the packages from older OS
releases in one repo, and the people running managed systems probably
do their own packaging anyway, they'll figure it out. I realise that
you are involved in packaging and I am not, so we probably have
different perspectives: I get to say "and here, magic happens!" :-)

FWIW at least 57, 63 and 67 (corresponding to deb9, 10, 11) from
http://ftp.debian.org/debian/pool/main/i/icu/ can be installed with
dpkg -i on my Debian 11 machine. 52 (deb8) too, probably, but it has
dependencies I didn't look into. 71 and 72 are newer than the -dev
version (what we link against), so I didn't try installing but the
patch as posted wouldn't let me open them: the idea here is to allow
only older stuff to be dlopen'd, so if a breaking API change comes
down the pipe we'll be able to deal with it. Not being a packaging
guy, I don't how how stupid it would be to build a package repo that
literally just exposes these via an index and that's all, or whether
it's better to rebuild the ICU versions from source against modern
C/C++ runtimes etc.

#87

Peter Eisentraut

peter.eisentraut@enterprisedb.com

about 3 years ago

In reply to: Thomas Munro (#86)

Re: Collation version tracking for macOS

On 02.11.22 00:57, Thomas Munro wrote:

3. Library availability. This is a problem for downstream
communities to solve. For example, the people who build Windows
installers might want to start bundling the ICU versions from their
earlier releases, the people involved with each Linux/BSD distro would
hopefully figure out a good way to publish the packages from older OS
releases in one repo, and the people running managed systems probably
do their own packaging anyway, they'll figure it out. I realise that
you are involved in packaging and I am not, so we probably have
different perspectives: I get to say "and here, magic happens!" :-)

I made a Homebrew repository for ICU versions 50 through 72:
https://github.com/petere/homebrew-icu

All of these packages build and pass their self-tests on my machine. So
from that experience, I think maintaining a repository of ICU versions,
and being able to install more than one for testing this feature, is
feasible.

Now I have started building PostgreSQL against these, to get some
baseline of what is supported and actually works. The results are a bit
mixed so far, more to come later.

The installation instructions currently say that the minimum required
version of ICU is 4.2. That was the one that shipped with RHEL 6. I
think we have de-supported RHEL 6 and could increase that. The version
in RHEL 7 is 50.

(My repository happens to start at 50 because the new versioning system
started at 49, but 49 doesn't appear to be tagged at the icu github site.)

Note: Recent versions of libxml2 link against icu. This isn't a
problem, thanks to the symbol versioning, but if you get libxml2 via
pkg-config, you might get LDFLAGS from not the icu version you wanted.

#88

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Peter Eisentraut (#87)

Re: Collation version tracking for macOS

On Tue, Nov 8, 2022 at 1:22 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

I made a Homebrew repository for ICU versions 50 through 72:
https://github.com/petere/homebrew-icu

Nice!

All of these packages build and pass their self-tests on my machine. So
from that experience, I think maintaining a repository of ICU versions,
and being able to install more than one for testing this feature, is
feasible.

I wonder what the situation with CVEs is in older releases. I heard a
rumour that upstream might only patch current + previous, leaving it
up to distros to back-patch to whatever they need to support, but I
haven't tried to track down cold hard evidence of this or think about
what it means for this project...

#89

Peter Eisentraut

peter.eisentraut@enterprisedb.com

about 3 years ago

In reply to: Thomas Munro (#83)

1 attachment(s)

Re: Collation version tracking for macOS

On 22.10.22 03:22, Thomas Munro wrote:

I'd love to hear others' thoughts on how we can turn this into a
workable solution. Hopefully while staying simple...

I played with this patch a bit. It looks like a reasonable approach.

Attached is a small patch to get the dynamic libicu* lookup working with
the library naming on macOS.

Instead of packing the ICU version into the locale field ('63:en'), I
would make it a separate field in pg_collation and a separate argument
in CREATE COLLATION.

At this point, perhaps it would be good to start building some tests to
demonstrate various upgrade scenarios and to ensure portability.

Attachments:

0001-fixup-WIP-Multi-version-ICU.patchtext/plain; charset=UTF-8; name=0001-fixup-WIP-Multi-version-ICU.patchDownload

From e236f5257bf0bf3e7b83b9d9b095d1d0e3fdc971 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Fri, 11 Nov 2022 15:44:44 +0100
Subject: [PATCH] fixup! WIP: Multi-version ICU.

---
 src/backend/utils/adt/pg_locale.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 666a79b907a4..3ffb9706ff99 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1600,6 +1600,10 @@ get_icu_library(int major_version)
 				 "%s%sicui18n%d." DLSUFFIX,
 				 icu_library_path,
 				 icu_library_path[0] ? "\\" : "",
+#elif defined(__darwin__)
+				 "%s%slibicui18n.%d" DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
 #else
 				 "%s%slibicui18n" DLSUFFIX ".%d",
 				 icu_library_path,
@@ -1615,6 +1619,10 @@ get_icu_library(int major_version)
 				 "%s%sicuuc%d." DLSUFFIX,
 				 icu_library_path,
 				 icu_library_path[0] ? "\\" : "",
+#elif defined(__darwin__)
+				 "%s%slibicuuc.%d" DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
 #else
 				 "%s%slibicuuc" DLSUFFIX ".%d",
 				 icu_library_path,
-- 
2.38.1

#90

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#83)

Re: Collation version tracking for macOS

I looked at v6.

* We'll need some clearer instructions on how to build/install extra
ICU versions that might not be provided by the distribution packaging.
For instance, I got a cryptic error until I used --enable-rpath, which
might not be obvious to all users.
* Can we have a better error when the library was built with --
disable-renaming? We can just search for the plain (no suffix) symbol.
* We should use dlerror() instead of %m to report dlopen() errors.
* It seems like the collation version is just there to issue WARNINGs
when a user is using the non-versioned locale syntax and the library
changes underneath them (or if there is collation version change within
a single ICU major version)?
* How are you testing this?
* In my tests (sort, hacked so abbreviate is always false), I see a
~3% regression for ICU+UTF8. That's fine with me. I assume it's due to
the indirect function call, but that's not obvious to me from the
profile. If it's a major problem we could have a special case of
varstrfastcmp_locale() that works on the compile-time ICU version.

I realize your patch is experimental, but when there is a better
consensus on the approach, we should consider adding declarative syntax
such as:

CREATE COLLATION (or LOCALE?) PROVIDER icu67
TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';

It will offer more opportunities to catch errors early and offer better
error messages. It would also enable it to function if the library is
built with --disable-renaming (though we'd have to trust the user).

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:

Problem 1: Suppose you're ready to start using (say) v72. I guess
you'd use the REFRESH command, which would open the main linked ICU's
collversion and stamp that into the catalogue, at which point new
sessions would start using that, and then you'd have to rebuild all
your indexes (with no help from PG to tell you how to find everything
that needs to be rebuilt, as belaboured in previous reverted work).
Aside from the possibility of getting the rebuilding job wrong (as
belaboured elsewhere), it's not great, because there is still a
transitional period where you can be using the wrong version for your
data. So this requires some careful planning and understanding from
the administrator.

How is this related to the search-by-collversion design? It seems like
it's hard no matter what.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#91

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#90)

1 attachment(s)

Re: Collation version tracking for macOS

Replying to Peter and Jeff in one email.

On Sat, Nov 12, 2022 at 3:57 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 22.10.22 03:22, Thomas Munro wrote:

I'd love to hear others' thoughts on how we can turn this into a
workable solution. Hopefully while staying simple...

I played with this patch a bit. It looks like a reasonable approach.

Great news.

Attached is a small patch to get the dynamic libicu* lookup working with
the library naming on macOS.

Thanks, squashed.

Instead of packing the ICU version into the locale field ('63:en'), I
would make it a separate field in pg_collation and a separate argument
in CREATE COLLATION.

I haven't tried this yet, as I focused on coming up with a way of testing in
this iteration. I can try this next. I'm imagining that we'd have
pg_collation.collicuversion and pg_database.daticuversion, and they'd default
to 0 for "use the GUC", and perhaps you'd even be able to ALTER them. Perhaps
we wouldn't even need the GUC then... 0 could mean "the linked version", and
if you don't like it, you ALTER it. Thinking about this.

At this point, perhaps it would be good to start building some tests to
demonstrate various upgrade scenarios and to ensure portability.

OK, here's what I came up with. You enable it in PG_TEST_EXTRA, and
tell it about an alternative ICU version you have in the standard library
search path that is not the same as the main/linked one:

$ meson configure -DPG_TEST_EXTRA="icu=63"
$ meson test icu/020_multiversion

Another change from your feedback: you mentioned that RHEL7 shipped with ICU
50, so I removed my suggestion of dropping some extra code we carry for
versions before 54 and set the minimum acceptable version to 50. It probably
works further back than that, but that's a decent range, I think.

On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote:

I looked at v6.

Thanks for jumping in and testing!

* We'll need some clearer instructions on how to build/install extra
ICU versions that might not be provided by the distribution packaging.
For instance, I got a cryptic error until I used --enable-rpath, which
might not be obvious to all users.

Suggestions welcome. No docs at all yet...

* Can we have a better error when the library was built with --
disable-renaming? We can just search for the plain (no suffix) symbol.

I threw out that symbol probing logic, and wrote something simpler that should
now also work with --disable-renaming (though not tested). Now it does a
cross-check with the library's self-reported major version, just to make
sure there wasn't a badly named library file, which may be more likely
with --disable-renaming.

* We should use dlerror() instead of %m to report dlopen() errors.

Fixed.

* It seems like the collation version is just there to issue WARNINGs
when a user is using the non-versioned locale syntax and the library
changes underneath them (or if there is collation version change within
a single ICU major version)?

Correct.

I have now updated the warning messages you get when they don't match, to
provide a hint about what to do about it. I am sure they need some more
word-smithing, though.

* How are you testing this?

Ad hoc noodling before now, but see attached.

I realize your patch is experimental, but when there is a better
consensus on the approach, we should consider adding declarative syntax
such as:

CREATE COLLATION (or LOCALE?) PROVIDER icu67
TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';

It will offer more opportunities to catch errors early and offer better
error messages. It would also enable it to function if the library is
built with --disable-renaming (though we'd have to trust the user).

Earlier in this and other threads, we wondered if each ICU major version should
be a separate provider, which is what you're showing there, or should be an
independent property of an individual COLLATION, which is what v6 did with
'63:en' and what Peter suggested I make more formal with CREATE COLLATION foo
(..., ICU_VERSION=63). I actually started out thinking we'd have multiple
providers, but I couldn't really think of any advantage, and I think it makes
some upgrade scenarios more painful. Can you elaborate on why you'd want
that model?

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:

Problem 1: Suppose you're ready to start using (say) v72. I guess
you'd use the REFRESH command, which would open the main linked ICU's
collversion and stamp that into the catalogue, at which point new
sessions would start using that, and then you'd have to rebuild all
your indexes (with no help from PG to tell you how to find everything
that needs to be rebuilt, as belaboured in previous reverted work).
Aside from the possibility of getting the rebuilding job wrong (as
belaboured elsewhere), it's not great, because there is still a
transitional period where you can be using the wrong version for your
data. So this requires some careful planning and understanding from
the administrator.

How is this related to the search-by-collversion design? It seems like
it's hard no matter what.

Yeah. I just don't like the way it *appears* to be doing something clever, but
it doesn't solve any fundamental problem at all because the collversion
information is under human control and so it's really doing something stupid.
Hence desire to build something that at least admits that it's primitive and
just gives you some controls, in a first version. We could always reconsider
that in later work though, maybe even an optional policy or something?

Attachments:

v7-0001-WIP-Multi-version-ICU.patchtext/x-patch; charset=US-ASCII; name=v7-0001-WIP-Multi-version-ICU.patchDownload

From 51f0e2eaaf8e941033ad4ba7e412fc900636962d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v7] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

By default, the system linker's default search path is used to find
libraries, but icu_library_path may be used to specify an absolute path
to look in.  ICU libraries are expected to have been built without ICU's
--disable-renaming option.  That is, major versions must use distinct
symbol names.

This arrangement means that at least one major version of ICU is always
available -- the one that PostgreSQL was linked again.  It should be
simple on most software distributions to install extra versions using a
package manager, or to build extra libraries as required, to access
older ICU releases.  For example, on Debian bullseye the packages are
named libicu63, libicu67, libicu71.

In this version of the patch, '63:en' used as a database default locale
or COLLATION object requests ICU library 63, and 'en' requests the
library version seleted by the GUC default_icu_library_version,
defaulting to the version that the executable is linked against.

XXX Many other designs possible, to discuss!

Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c            |  16 +-
 src/backend/commands/collationcmds.c          |  20 +
 src/backend/utils/adt/formatting.c            |  53 +-
 src/backend/utils/adt/pg_locale.c             | 451 +++++++++++++++++-
 src/backend/utils/adt/varchar.c               |  16 +-
 src/backend/utils/adt/varlena.c               |  56 +--
 src/backend/utils/init/postinit.c             |  48 +-
 src/backend/utils/misc/guc_tables.c           |  28 ++
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/catalog/pg_proc.dat               |   3 +
 src/include/utils/pg_locale.h                 |  75 +++
 src/test/icu/meson.build                      |   1 +
 src/test/icu/t/020_multiversion.pl            | 203 ++++++++
 src/tools/pgindent/typedefs.list              |   3 +
 14 files changed, 888 insertions(+), 90 deletions(-)
 create mode 100644 src/test/icu/t/020_multiversion.pl

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 81e54e0ce6..4fb0c77f38 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -853,6 +853,26 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 					CreateComments(collid, CollationRelationId, 0,
 								   icucomment);
 			}
+
+			/* Also create an object pinned to an ICU major version. */
+			collid = CollationCreate(psprintf("%s-x-icu-%d", langtag, U_ICU_VERSION_MAJOR_NUM),
+									 nspid, GetUserId(),
+									 COLLPROVIDER_ICU, true, -1,
+									 NULL, NULL,
+									 psprintf("%d:%s", U_ICU_VERSION_MAJOR_NUM, iculocstr),
+									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 true, true);
+			if (OidIsValid(collid))
+			{
+				ncreated++;
+
+				CommandCounterIncrement();
+
+				icucomment = get_icu_locale_comment(name);
+				if (icucomment)
+					CreateComments(collid, CollationRelationId, 0,
+								   icucomment);
+			}
 		}
 	}
 #endif							/* USE_ICU */
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 26f498b5df..0c3c7724d7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1599,6 +1599,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1623,18 +1628,41 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->info.icu.locale, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->info.icu.locale, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1702,7 +1730,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToLower,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1824,7 +1853,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToUpper,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1947,8 +1977,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
+			len_conv = icu_convert_case_bi(PG_ICU_LIB(mylocale)->strToTitle,
+										   mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..3cc51a54a8 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/guc_hooks.h"
@@ -69,6 +70,7 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -79,14 +81,35 @@
 #include <shlwapi.h>
 #endif
 
+#include <dlfcn.h>
+
 #define		MAX_L10N_DATA		80
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/*
+ * The oldest ICU release we're likely to encounter, and that has all the
+ * funcitons required.
+ */
+#define PG_MIN_ICU_MAJOR_VERSION 50
+
+#endif
+
 
 /* GUC settings */
 char	   *locale_messages;
 char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
+char	   *icu_library_path;
+int			default_icu_library_version;
 
 /*
  * lc_time localization cache.
@@ -1398,29 +1421,348 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;
 
+/* Linked list of ICU libraries we have loaded. */
+static pg_icu_library *icu_library_list = NULL;
+
+/*
+ * Free an ICU library.  pg_icu_library objects that are successfully
+ * constructed stick around for the lifetime of the backend, but this is used
+ * to clean up if initialization fails.
+ */
+static void
+free_icu_library(pg_icu_library *lib)
+{
+	if (lib->libicui18n_handle)
+		dlclose(lib->libicui18n_handle);
+	if (lib->libicuuc_handle)
+		dlclose(lib->libicuuc_handle);
+	pfree(lib);
+}
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char		function_with_version[80];
+	void	   *result;
+
+	/*
+	 * Try to look it up using the symbols with major versions, but if that
+	 * doesn't work, also try the unversioned name in case the library was
+	 * configured with --disable-renaming.
+	 */
+	snprintf(function_with_version, sizeof(function_with_version), "%s_%d",
+			 function, version);
+	result = dlsym(handle, function_with_version);
+
+	return result ? result : dlsym(handle, function);
+}
+
+/*
+ * Helper to load a library.
+ */
+static void *
+load_icu_library(pg_icu_library *lib, const char *name)
+{
+	void	   *handle;
+
+	handle = dlopen(name, RTLD_NOW | RTLD_GLOBAL);
+	if (handle == NULL)
+	{
+		char		message[80];
+
+		strlcpy(message, dlerror(), sizeof(message));
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("could not load library \"%s\": %s", name, message)));
+	}
+
+	return handle;
+}
+
+/*
+ * Given an ICU major version number, return the object we need to access it,
+ * or fail while trying to load it.
+ */
+static pg_icu_library *
+get_icu_library(int major_version)
+{
+	UVersionInfo versioninfo;
+	char		versioninfostring[U_MAX_VERSION_STRING_LENGTH];
+	pg_icu_library *lib;
+
+	/* XXX Move range check into guc_table.c? */
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			"ICU version must be between %d and %d",
+			 PG_MIN_ICU_MAJOR_VERSION,
+			 PG_MAX_ICU_MAJOR_VERSION);
+
+	/* Try to find it in our list of existing libraries. */
+	for (lib = icu_library_list; lib; lib = lib->next)
+		if (lib->major_version == major_version)
+			return lib;
+
+	/* Make a new entry. */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(*lib));
+	if (major_version == U_ICU_VERSION_MAJOR_NUM)
+	{
+		/*
+		 * This is the version we were compiled and linked against.  Simply
+		 * assign the function pointers.
+		 *
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		lib->major_version = major_version;
+		lib->getLibraryVersion = u_getVersion;
+		lib->open = ucol_open;
+		lib->close = ucol_close;
+		lib->getVersion = ucol_getVersion;
+		lib->versionToString = u_versionToString;
+		lib->strcoll = ucol_strcoll;
+		lib->strcollUTF8 = ucol_strcollUTF8;
+		lib->getSortKey = ucol_getSortKey;
+		lib->nextSortKeyPart = ucol_nextSortKeyPart;
+		lib->setUTF8 = uiter_setUTF8;
+		lib->errorName = u_errorName;
+		lib->strToUpper = u_strToUpper;
+		lib->strToLower = u_strToLower;
+		lib->strToTitle = u_strToTitle;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+
+		/*
+		 * We don't like to open versions newer than what we're linked
+		 * against, to reduce the risk of an API change biting us.
+		 */
+		if (major_version > U_ICU_VERSION_MAJOR_NUM)
+			elog(ERROR, "ICU major version %d higher than linked version %d, refusing to open",
+				 major_version, U_ICU_VERSION_MAJOR_NUM);
+
+		lib->major_version = major_version;
+
+		/*
+		 * See
+		 * https://unicode-org.github.io/icu/userguide/icu4c/packaging.html#icu-versions
+		 * for conventions on library naming on POSIX and Windows systems.
+		 */
+
+		/* Load the collation library. */
+		snprintf(libicui18n_name,
+				 sizeof(libicui18n_name),
+#ifdef WIN32
+				 "%s%sicui18n%d." DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "\\" : "",
+#elif defined(__darwin__)
+				 "%s%slibicui18n.%d" DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#else
+				 "%s%slibicui18n" DLSUFFIX ".%d",
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#endif
+				 major_version);
+		lib->libicui18n_handle = load_icu_library(lib, libicui18n_name);
+
+		/* Load the ctype library. */
+		snprintf(libicuuc_name,
+				 sizeof(libicuuc_name),
+#ifdef WIN32
+				 "%s%sicuuc%d." DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "\\" : "",
+#elif defined(__darwin__)
+				 "%s%slibicuuc.%d" DLSUFFIX,
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#else
+				 "%s%slibicuuc" DLSUFFIX ".%d",
+				 icu_library_path,
+				 icu_library_path[0] ? "/" : "",
+#endif
+				 major_version);
+		lib->libicuuc_handle = load_icu_library(lib, libicuuc_name);
+
+		/* Look up all the functions we need. */
+		lib->getLibraryVersion = get_icu_function(lib->libicui18n_handle,
+												  "u_getVersion",
+												  major_version);
+		lib->open = get_icu_function(lib->libicui18n_handle,
+									 "ucol_open",
+									 major_version);
+		lib->close = get_icu_function(lib->libicui18n_handle,
+									  "ucol_close",
+									  major_version);
+		lib->getVersion = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getVersion",
+										   major_version);
+		lib->versionToString = get_icu_function(lib->libicui18n_handle,
+												"u_versionToString",
+												major_version);
+		lib->strcoll = get_icu_function(lib->libicui18n_handle,
+										"ucol_strcoll",
+										major_version);
+		lib->strcollUTF8 = get_icu_function(lib->libicui18n_handle,
+											"ucol_strcollUTF8",
+											major_version);
+		lib->getSortKey = get_icu_function(lib->libicui18n_handle,
+										   "ucol_getSortKey",
+										   major_version);
+		lib->nextSortKeyPart = get_icu_function(lib->libicui18n_handle,
+												"ucol_nextSortKeyPart",
+												major_version);
+		lib->setUTF8 = get_icu_function(lib->libicui18n_handle,
+										"uiter_setUTF8",
+										major_version);
+		lib->errorName = get_icu_function(lib->libicui18n_handle,
+										  "u_errorName",
+										  major_version);
+		lib->strToUpper = get_icu_function(lib->libicuuc_handle,
+										   "u_strToUpper",
+										   major_version);
+		lib->strToLower = get_icu_function(lib->libicuuc_handle,
+										   "u_strToLower",
+										   major_version);
+		lib->strToTitle = get_icu_function(lib->libicuuc_handle,
+										   "u_strToTitle",
+										   major_version);
+		if (!lib->getLibraryVersion ||
+			!lib->open ||
+			!lib->close ||
+			!lib->getVersion ||
+			!lib->versionToString ||
+			!lib->strcoll ||
+			!lib->strcollUTF8 ||
+			!lib->getSortKey ||
+			!lib->nextSortKeyPart ||
+			!lib->setUTF8 ||
+			!lib->errorName ||
+			!lib->strToUpper ||
+			!lib->strToLower ||
+			!lib->strToTitle)
+		{
+			free_icu_library(lib);
+			ereport(ERROR,
+					(errmsg("could not find expected symbols in libraries \"%s\" and \"%s\"",
+							libicui18n_name, libicuuc_name)));
+		}
+	}
+
+	/*
+	 * Check that the library's own u_getVersion() function reports the version
+	 * that we expected.  By using atoi() we take only the major part.
+	 */
+	lib->getLibraryVersion(versioninfo);
+	lib->versionToString(versioninfo, versioninfostring);
+	if (atoi(versioninfostring) != major_version)
+	{
+		free_icu_library(lib);
+		ereport(ERROR,
+				(errmsg("opened ICU library with major version %d but it reported its own version as %s",
+						major_version, versioninfostring)));
+	}
+
+	lib->next = icu_library_list;
+	icu_library_list = lib;
+
+	return lib;
+}
+
+/*
+ * Look up the library to use for a given collcollate string.
+ */
+static pg_icu_library *
+get_icu_library_for_collation(const char *collcollate, const char **rest)
+{
+	int			major_version;
+	char	   *separator;
+	char	   *after_prefix;
+
+	separator = strchr(collcollate, ':');
+
+	/*
+	 * If it's a traditional value without a prefix, use the default ICU
+	 * library.  That's the one we were linked against, or another one if
+	 * default_icu_library_version has been set.
+	 */
+	if (separator == NULL)
+	{
+		*rest = collcollate;
+
+		if (default_icu_library_version > 0)
+			major_version = default_icu_library_version;
+		else
+			major_version = U_ICU_VERSION_MAJOR_NUM;
+		return get_icu_library(major_version);
+	}
+
+	/* If it has a prefix, interpret it as an ICU major version. */
+	major_version = strtol(collcollate, &after_prefix, 10);
+	if (after_prefix != separator)
+		elog(ERROR,
+			 "could not parse ICU major library version: \"%s\"",
+			 collcollate);
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+		elog(ERROR,
+			 "ICU major library verision out of supported range: \"%s\"",
+			 collcollate);
+
+	/* The part after the separate will be passed to the library. */
+	*rest = separator + 1;
+
+	return get_icu_library(major_version);
+}
+
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  struct pg_locale_struct *resultp)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(iculocstr, &iculocstr);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = lib->open(iculocstr, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						iculocstr, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, iculocstr);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.lib = lib;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
@@ -1593,14 +1935,15 @@ pg_newlocale_from_collation(Oid collid)
 		{
 			char	   *actual_versionstr;
 			char	   *collversionstr;
+			char	   *locale;
 
 			collversionstr = TextDatumGetCString(datum);
 
 			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
 			Assert(!isnull);
+			locale = TextDatumGetCString(datum);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
+			actual_versionstr = get_collation_actual_version(collform->collprovider, locale);
 			if (!actual_versionstr)
 			{
 				/*
@@ -1614,17 +1957,44 @@ pg_newlocale_from_collation(Oid collid)
 			}
 
 			if (strcmp(actual_versionstr, collversionstr) != 0)
-				ereport(WARNING,
-						(errmsg("collation \"%s\" has version mismatch",
-								NameStr(collform->collname)),
-						 errdetail("The collation in the database was created using version %s, "
-								   "but the operating system provides version %s.",
-								   collversionstr, actual_versionstr),
-						 errhint("Rebuild all objects affected by this collation and run "
-								 "ALTER COLLATION %s REFRESH VERSION, "
-								 "or build PostgreSQL with the right library version.",
-								 quote_qualified_identifier(get_namespace_name(collform->collnamespace),
-															NameStr(collform->collname)))));
+			{
+				if (collform->collprovider == COLLPROVIDER_ICU)
+				{
+					ereport(WARNING,
+							(errmsg("collation \"%s\" has version mismatch",
+									NameStr(collform->collname)),
+							 errdetail("The collation in the database was created using version %s, "
+									   "but the ICU library provides version %s.",
+									   collversionstr, actual_versionstr),
+							 strchr(locale, ':') != NULL ?
+							 errhint("Rebuild all objects affected by this collation and run "
+									 "ALTER COLLATION %s REFRESH VERSION, "
+									 "or build PostgreSQL with the right library version.",
+									 quote_qualified_identifier(get_namespace_name(collform->collnamespace),
+																NameStr(collform->collname))) :
+							 errhint("Install another version of ICU and select it using "
+									 "default_icu_libary_version, "
+									 "or rebuild all objects affect by this collation and run "
+									 "ALTER COLLATION %s REFRESH VERSION, "
+									 "or build PostgreSQL with the right library version.",
+									 quote_qualified_identifier(get_namespace_name(collform->collnamespace),
+																NameStr(collform->collname)))));
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("collation \"%s\" has version mismatch",
+									NameStr(collform->collname)),
+							 errdetail("The collation in the database was created using version %s, "
+									   "but the operating system provides version %s.",
+									   collversionstr, actual_versionstr),
+							 errhint("Rebuild all objects affected by this collation and run "
+									 "ALTER COLLATION %s REFRESH VERSION, "
+									 "or build PostgreSQL with the right library version.",
+									 quote_qualified_identifier(get_namespace_name(collform->collnamespace),
+																NameStr(collform->collname)))));
+				}
+			}
 		}
 
 		ReleaseSysCache(tp);
@@ -1651,21 +2021,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *lib;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		lib = get_icu_library_for_collation(collcollate, &collcollate);
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = lib->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, lib->errorName(status))));
+		lib->getVersion(collator, versioninfo);
+		lib->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		lib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1733,6 +2105,33 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 
 #ifdef USE_ICU
+
+/*
+ * Given a major version number, look up that library and ask it for the
+ * complete version string.
+ */
+Datum
+pg_icu_library_version(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+	int			major_version;
+	pg_icu_library *lib;
+	UVersionInfo versioninfo;
+	char		buf[U_MAX_VERSION_STRING_LENGTH];
+
+	major_version = PG_GETARG_INT32(0);
+	if (major_version <= 0)
+		major_version = U_ICU_VERSION_MAJOR_NUM;
+
+	lib = get_icu_library(major_version);
+	lib->getLibraryVersion(versioninfo);
+	lib->versionToString(versioninfo, buf);
+	PG_RETURN_TEXT_P(cstring_to_text(buf));
+#else
+	PG_RETURN_NULL();
+#endif
+}
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1954,19 +2353,21 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_icu_library_for_collation(icu_locale, &icu_locale);
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = lib->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, lib->errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
+	if (lib->major_version < 54)
 		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	lib->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 68e2e6f7a7..e0c86870e0 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1026,11 +1026,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1087,11 +1087,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c5e7ee7ca2..cf891a5654 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1667,13 +1667,14 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = PG_ICU_LIB(mylocale)->strcollUTF8(PG_ICU_COL(mylocale),
+															   arg1, len1,
+															   arg2, len2,
+															   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
+								(errmsg("collation failed: %s",
+										PG_ICU_LIB(mylocale)->errorName(status))));
 				}
 				else
 #endif
@@ -1686,9 +1687,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = PG_ICU_LIB(mylocale)->strcoll(PG_ICU_COL(mylocale),
+														   uchar1, ulen1,
+														   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2388,13 +2389,14 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = PG_ICU_LIB(sss->locale)->strcollUTF8(PG_ICU_COL(sss->locale),
+															  a1p, len1,
+															  a2p, len2,
+															  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									PG_ICU_LIB(sss->locale)->errorName(status))));
 			}
 			else
 #endif
@@ -2407,9 +2409,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = PG_ICU_LIB(sss->locale)->strcoll(PG_ICU_COL(sss->locale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2569,24 +2571,24 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uint32_t	state[2];
 					UErrorCode	status;
 
-					uiter_setUTF8(&iter, sss->buf1, len);
+					PG_ICU_LIB(sss->locale)->setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = PG_ICU_LIB(sss->locale)->nextSortKeyPart(PG_ICU_COL(sss->locale),
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										PG_ICU_LIB(sss->locale)->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = PG_ICU_LIB(sss->locale)->getSortKey(PG_ICU_COL(sss->locale),
+																uchar, ulen,
+																(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..d18aa7a3df 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -449,26 +449,52 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	{
 		char	   *actual_versionstr;
 		char	   *collversionstr;
+		char	   *locale;
 
 		collversionstr = TextDatumGetCString(datum);
+		locale = dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate;
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
+		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, locale);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
 				 "database \"%s\" has no actual collation version, but a version was recorded",
 				 name);
 		else if (strcmp(actual_versionstr, collversionstr) != 0)
-			ereport(WARNING,
-					(errmsg("database \"%s\" has a collation version mismatch",
-							name),
-					 errdetail("The database was created using collation version %s, "
-							   "but the operating system provides version %s.",
-							   collversionstr, actual_versionstr),
-					 errhint("Rebuild all objects in this database that use the default collation and run "
-							 "ALTER DATABASE %s REFRESH COLLATION VERSION, "
-							 "or build PostgreSQL with the right library version.",
-							 quote_identifier(name))));
+		{
+			if (dbform->datlocprovider == COLLPROVIDER_ICU)
+			{
+				ereport(WARNING,
+						(errmsg("database \"%s\" has a collation version mismatch",
+								name),
+						 errdetail("The database was created using collation version %s, "
+								   "but the ICU library provides version %s.",
+								   collversionstr, actual_versionstr),
+						 strchr(locale, ':') != NULL ?
+						 errhint("Rebuild all objects in this database that use the default collation and run "
+								 "ALTER DATABASE %s REFRESH COLLATION VERSION, "
+								 "or build PostgreSQL with the right library version.",
+								 quote_identifier(name)) :
+						 errhint("Install another version of ICU and select it using default_icu_library_verison, or "
+								 "rebuild all objects in this database that use the default collation and run "
+								 "ALTER DATABASE %s REFRESH COLLATION VERSION, "
+								 "or build PostgreSQL with the right library version.",
+								 quote_identifier(name))));
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("database \"%s\" has a collation version mismatch",
+								name),
+						 errdetail("The database was created using collation version %s, "
+								   "but the operating system provides version %s.",
+								   collversionstr, actual_versionstr),
+						 errhint("Rebuild all objects in this database that use the default collation and run "
+								 "ALTER DATABASE %s REFRESH COLLATION VERSION, "
+								 "or build PostgreSQL with the right library version.",
+								 quote_identifier(name))));
+			}
+		}
 	}
 
 	/* Make the locale settings visible as GUC variables, too */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 836b49484a..25e905ce8b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2937,6 +2937,20 @@ struct config_int ConfigureNamesInt[] =
 		check_max_worker_processes, NULL, NULL
 	},
 
+	{
+		{"default_icu_library_version",
+			PGC_SUSET,
+			COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Default major version of ICU library to use for collations if not specified."),
+			NULL
+		},
+		&default_icu_library_version,
+		0,
+		0,
+		1000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"max_logical_replication_workers",
 			PGC_POSTMASTER,
@@ -3920,6 +3934,20 @@ struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"icu_library_path", PGC_SUSET, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Sets the path for dynamically loadable ICU libraries."),
+			gettext_noop("If versions of ICU other than the one that "
+						 "PostgreSQL is linked against are needed, they will "
+						 "be opened from this directory.  If empty, the "
+						 "system linker search path will be used."),
+			GUC_SUPERUSER_ONLY
+		},
+		&icu_library_path,
+		"",
+		NULL, NULL, NULL
+	},
+
 	{
 		{"krb_server_keyfile", PGC_SIGHUP, CONN_AUTH_AUTH,
 			gettext_noop("Sets the location of the Kerberos server key file."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351..2713c92124 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -727,6 +727,11 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#default_icu_library_version = 0	# default major version of ICU library
+					# (0 for the linked version)
+#icu_library_path = ''			# path for dynamically loaded ICU
+					# libraries
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9dbe9ec801..a76eb6c94b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11707,6 +11707,9 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '8888', descr => 'get ICU library version string',
+  proname => 'pg_icu_library_version', provolatile => 'v', prorettype => 'text',
+  proargtypes => 'int4', prosrc => 'pg_icu_library_version' },
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..c52fe0c7df 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,7 @@
 #endif
 #ifdef USE_ICU
 #include <unicode/ucol.h>
+#include <unicode/ubrk.h>
 #endif
 
 #ifdef USE_ICU
@@ -40,6 +41,8 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT char *icu_library_path;
+extern PGDLLIMPORT int default_icu_library_version;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -63,6 +66,72 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	void	   *libicui18n_handle;
+	void	   *libicuuc_handle;
+	void		(*getLibraryVersion) (UVersionInfo info);
+	UCollator  *(*open) (const char *loc, UErrorCode *status);
+	void		(*close) (UCollator *coll);
+	void		(*getVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+				UCollationResult(*strcoll) (const UCollator *coll,
+											const UChar *source,
+											int32_t sourceLength,
+											const UChar *target,
+											int32_t targetLength);
+				UCollationResult(*strcollUTF8) (const UCollator *coll,
+												const char *source,
+												int32_t sourceLength,
+												const char *target,
+												int32_t targetLength,
+												UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -84,12 +153,18 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+#ifdef USE_ICU
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+#endif
+
 typedef struct pg_locale_struct *pg_locale_t;
 
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
diff --git a/src/test/icu/meson.build b/src/test/icu/meson.build
index 5a4f53f37f..ac2672190e 100644
--- a/src/test/icu/meson.build
+++ b/src/test/icu/meson.build
@@ -5,6 +5,7 @@ tests += {
   'tap': {
     'tests': [
       't/010_database.pl',
+      't/020_multiversion.pl',
     ],
     'env': {'with_icu': icu.found() ? 'yes' : 'no'},
   },
diff --git a/src/test/icu/t/020_multiversion.pl b/src/test/icu/t/020_multiversion.pl
new file mode 100644
index 0000000000..52408deb59
--- /dev/null
+++ b/src/test/icu/t/020_multiversion.pl
@@ -0,0 +1,203 @@
+# Copyright (c) 2022, PostgreSQL Global Development Group
+
+# This test requires a second major version of ICU installed in the usual
+# system library search path.  That is, not the one PostgreSQL was linked
+# against.  It also assumes that ucol_getVersion() for locale "en" will change
+# between the two library versions.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+if (!($ENV{PG_TEST_EXTRA} =~ /\bicu=([0-9]+)\b/))
+{
+	plan skip_all => 'PG_TEST_EXTRA not configured to test an alternative ICU library version';
+}
+my $alt_major_version = $1;
+
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$node1->start;
+
+my $linked_major_version = $node1->safe_psql('postgres', 'select pg_icu_library_version(-1)::decimal::int');
+
+print "linked_major_version = $linked_major_version\n";
+print "alt_major_version = $alt_major_version\n";
+
+if ($alt_major_version ge $linked_major_version)
+{
+	BAIL_OUT("can't run multi-version tests because ICU major version selected via PG_TEST_EXTRA is not lower than the major version the executable is linked against ($linked_major_version)");
+}
+
+# Sanity check that when we load a library, its u_getVersion() function tells
+# us it has the major version we expect.  The result is a string eg "71.1", so
+# we get the major part by casting.
+is($node1->safe_psql('postgres', "select pg_icu_library_version($alt_major_version)::decimal::int"),
+	$alt_major_version,
+	"alt library reports expected major version");
+
+sub set_default_icu_library_version
+{
+	my $major_version = shift;
+	$node1->safe_psql('postgres', "alter system set default_icu_library_version = $major_version; select pg_reload_conf()");
+}
+
+my $ret;
+my $stderr;
+
+# Create a collation that doesn't specify the ICU version to use.  Which
+# library we load depends on the GUC default_icu_library_version.  Here it uses
+# the linked version because it's set to 0 (default value in a new cluster).
+set_default_icu_library_version(0);
+$node1->safe_psql('postgres', "create collation c1 (provider=icu, locale='en')");
+
+# No warning by default.
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c1", stderr => \$stderr);
+is($ret, 0, "can use collation");
+unlike($stderr, qr/WARNING/, "no warning for default");
+
+# No warning if we explicitly select the linked version.
+set_default_icu_library_version($linked_major_version);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c1", stderr => \$stderr);
+unlike($stderr, qr/WARNING/, "no warning for explicit match");
+
+# If we use a different major version explicitly, we get a warning that
+# includes a hint that we might be able to install and select a different ICU
+# version.
+set_default_icu_library_version($alt_major_version);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c1", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING/, "warning for incorrect major version");
+like($stderr, qr/HINT:  Install another version of ICU/, "warning suggests installing another ICU version");
+
+# Create a collation using the alt version without specifying it explicitly.
+# This simulates a collation that was created by a different build linked
+# against an older ICU.
+$node1->safe_psql('postgres', "create collation c2 (provider=icu, locale='en')");
+
+# Warning if we try to use it with default setttings.
+set_default_icu_library_version(0);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING/, "warning for incorrect major version");
+like($stderr, qr/HINT:  Install another version of ICU/, "warning suggests installing another ICU version");
+
+# No warning if we explicitly activate the alt version.
+set_default_icu_library_version($alt_major_version);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning for explicit match");
+
+# Refresh the version...  this will update it from the linked version (or
+# whatever default_icu_library_version points to, here it's 0 and thus the
+# linked version), because c2 is not explicitly pinned to an ICU major version.
+set_default_icu_library_version(0);
+$ret = $node1->psql('postgres', "alter collation c2 refresh version", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/NOTICE:  changing version/, "version changes");
+
+# Now no warning.
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "warning has gone away after refresh");
+
+# Create a collation that is pinned to a specific version of ICU.
+$node1->safe_psql('postgres', "create collation c3 (provider=icu, locale='$alt_major_version:en')");
+
+# No warnings expected, no matter what default_icu_library_version says, because
+# we always load that exact library.
+set_default_icu_library_version($linked_major_version);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c3", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning for explicit lib");
+set_default_icu_library_version($alt_major_version);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c3", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning for explicit lib");
+set_default_icu_library_version(0);
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c3", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning for explicit lib");
+
+# Similar tests using the database default.
+
+set_default_icu_library_version(0);
+$node1->safe_psql('postgres', "create database db2 locale_provider = icu template = template0 icu_locale = 'en'");
+
+# No warning.
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# Warning when you log into the database.
+set_default_icu_library_version($alt_major_version);
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING/, "warning for incorrect major version");
+like($stderr, qr/HINT:  Install another version of ICU/, "warning suggests installing another ICU version");
+
+# One way to clear the warning is to REFRESH.
+$ret = $node1->psql('postgres', "alter database db2 refresh collation version", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/NOTICE:  changing version/, "version changes");
+
+# Now the warning is gone.
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# Now we go back to using the linked version, and we'll see the warning again.
+# Perhaps this case simulates the most likely real-world experience, when
+# moving to a new OS that has PostgreSQL packages linked against a later ICU
+# version, using all defaults.
+set_default_icu_library_version(0);
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING/, "warning for incorrect major version");
+like($stderr, qr/HINT:  Install another version of ICU/, "warning suggests installing another ICU version");
+
+# Option 1 is to get rid of the warning by installing the library and setting
+# the GUC.
+set_default_icu_library_version($alt_major_version);
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning after setting GUC");
+
+# Option 2 is to rebuild indexes etc and use REFRESH.
+set_default_icu_library_version(0);
+$ret = $node1->psql('postgres', "alter database db2 refresh collation version", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/NOTICE:  changing version/, "version changes");
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning after refresh");
+
+# None of this applies if you explicitly pinned your database to an specific
+# ICU major version in the first place, so we ignore the GUC.
+set_default_icu_library_version(0);
+$node1->safe_psql('postgres', "create database db3 locale_provider = icu template = template0 icu_locale = '$alt_major_version:en'");
+
+# No warning with all GUC settings.
+set_default_icu_library_version($alt_major_version);
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning with pinned library version");
+set_default_icu_library_version($linked_major_version);
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning with pinned library version");
+set_default_icu_library_version(0);
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning with pinned library version");
+
+$node1->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f8302f1ed1..12c11f1586 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1101,6 +1101,7 @@ HeapTupleTableSlot
 HistControl
 HotStandbyState
 I32
+ICU_Convert_BI_Func
 ICU_Convert_Func
 ID
 INFIX
@@ -2854,6 +2855,7 @@ TypeName
 U
 U32
 U8
+UBreakIterator
 UChar
 UCharIterator
 UColAttribute
@@ -3484,6 +3486,7 @@ pg_funcptr_t
 pg_gssinfo
 pg_hmac_ctx
 pg_hmac_errno
+pg_icu_library
 pg_int64
 pg_local_to_utf_combined
 pg_locale_t
-- 
2.30.2

#92

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#91)

Re: Collation version tracking for macOS

On Sat, Nov 19, 2022 at 7:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote:

I realize your patch is experimental, but when there is a better
consensus on the approach, we should consider adding declarative syntax
such as:

CREATE COLLATION (or LOCALE?) PROVIDER icu67
TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';

It will offer more opportunities to catch errors early and offer better
error messages. It would also enable it to function if the library is
built with --disable-renaming (though we'd have to trust the user).

Earlier in this and other threads, we wondered if each ICU major version should
be a separate provider, which is what you're showing there, or should be an
independent property of an individual COLLATION, which is what v6 did with
'63:en' and what Peter suggested I make more formal with CREATE COLLATION foo
(..., ICU_VERSION=63). I actually started out thinking we'd have multiple
providers, but I couldn't really think of any advantage, and I think it makes
some upgrade scenarios more painful. Can you elaborate on why you'd want
that model?

Hmm, thinking some more about this... I said the above thinking that
you couldn't change a provider after creating a database/collation.
But what if you could?

1. CREATE DATABASE x LOCALE_PROVIDER=icu ...;
2. Some time later after an upgrade, my postgres binary is linked
against a new ICU version and I start seeing warnings.
3. ALTER DATABASE x LOCALE_PROVIDER=icu63;

I suppose you shouldn't be allowed to change libc -> icu, but you
could change icu - > icuXXX, or I guess icuXXX -> icuXXX.

What if you didn't have to manually manage the set of available
providers with DDL like you showed, but we just automatically
supported "icu" (= the linked ICU, whatever it might be), and icu50 up
to icuXXX where XXX is the linked ICU's version? We can encode those
values + libc as an int, to replace the existing char the represents
providers in catalogues.

That's basically just a different way of encoding the same information
that Peter was suggesting I put in a new catalogue attribute. How do
you like that bikeshed colour?

#93

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#91)

Re: Collation version tracking for macOS

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:

Problem 2: If ICU 67 ever decides to report a different version for
a
given collation (would it ever do that? I don't expect so, but ...),
we'd be unable to open the collation with the search-by-collversion
design, and potentially the database. What is a user supposed to do
then? Presumably our error/hint for that would be "please insert the
correct ICU library into drive A", but now there is no correct
library

Let's say that Postgres is compiled against version 67.X, and the
sysadmin upgrades the ICU package to 67.Y, which reports a different
collation version for some locale.

Your current patch makes this impossible for the administrator to fix,
because there's no way to have two different libraries loaded with the
same major version number, so it will always pick the compiled-in ICU.
The user will be forced to accept the new version of the collation, see
WARNINGs in their logs, and possibly corrupt their indexes.

Search-by-collversion would still be frustrating for the admin, but at
least it would be possible to fix by compiling their own 67.X and
asking Postgres to search that library, too. We could make it slightly
more friendly by having an error that reports the libraries searched
and the collation versions found, if none of the versions match. We can
have a GUC that controls whether a failure to find the right version is
a WARNING or an ERROR.

On Sat, 2022-11-19 at 07:38 +1300, Thomas Munro wrote:

* We'll need some clearer instructions on how to build/install
extra
ICU versions that might not be provided by the distribution
packaging.
For instance, I got a cryptic error until I used --enable-rpath,
which
might not be obvious to all users.

Suggestions welcome. No docs at all yet...

I tried to write up some docs. It's hard to explain why we are exposing
to the user the collation version and the library version in these
different ways, and what effects they have.

The current patch feels like it hasn't decided whether the collation
version is ucol_getVersion() (collversion) or u_getVersion() (library
version). The collversion is more prominent in the UI (with its own
syntax), yet it's just a cross-check for whether to issue a WARNING or
not; while the library version is hidden in the locale field and it
actually decides which symbol is called.

Yeah. I just don't like the way it *appears* to be doing something
clever, but
it doesn't solve any fundamental problem at all because the
collversion
information is under human control and so it's really doing something
stupid.

I assume by "human control" you mean "ALTER COLLATION ... REFRESH
VERSION". I agree that relying on the admin's declaration is dubious,
especially when we provide no good advice on how to actually do that
safely.

But I don't see what using the library version instead buys us here,
except that library version is part of the LOCALE, and there's no ALTER
command for that. You could just as easily deprecate/eliminate the
ALTER COLLATION REFRESH VERSION, and then say that the collversion is
out of human control, too.

By introducing multiple libraries, I think we need to change that
syntax anyway, to be something like:

ALTER COLLATION ... SET VERSION TO '...'

or even:

ALTER COLLATION ... FORCE VERSION TO '...'

Hence desire to build something that at least admits that it's
primitive and
just gives you some controls, in a first version.

Using either the library version or the collation version seems
reasonably simple to me. But from a documentation and usability
standpoint, the way they are currently mixed seems confusing.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#94

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#93)

Re: Collation version tracking for macOS

On Tue, Nov 22, 2022 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:

Problem 2: If ICU 67 ever decides to report a different version for
a
given collation (would it ever do that? I don't expect so, but ...),
we'd be unable to open the collation with the search-by-collversion
design, and potentially the database. What is a user supposed to do
then? Presumably our error/hint for that would be "please insert the
correct ICU library into drive A", but now there is no correct
library

Let's say that Postgres is compiled against version 67.X, and the
sysadmin upgrades the ICU package to 67.Y, which reports a different
collation version for some locale.

Your current patch makes this impossible for the administrator to fix,
because there's no way to have two different libraries loaded with the
same major version number, so it will always pick the compiled-in ICU.
The user will be forced to accept the new version of the collation, see
WARNINGs in their logs, and possibly corrupt their indexes.

They could probably also 'pin' the older minor version package using
their package manager (= downgrade) until they're ready to upgrade and
use REFRESH VERSION to certify that they've rebuilt everything
relevant or are OK with risks. Not pretty I admit, but I think the
end result is about the same for search-for-collversion, because I
imagine that (1) the default behaviour on failure to search would
likely be to use the linked library instead and WARN about
[dat]collversion mismatch, so far the same, and (2) the set of people
who would really be prepared to compile their own copy of 67.X instead
of downgrading or REFRESHing (with or without rebuilding) is
vanishingly small.

Two questions I wondered about:

1. *Do* they change ucol_getVersion() values in minor releases? I
tried to find a written policy on that.
https://icu.unicode.org/processes is not encouraging: it gives the
example of a "third digit in an official release number" [changing]
because a CLDR change was incorporated. Hrmph. But that's clearly
not even the modern ICU versioning system (it made a change a bit like
ours in 49, making the first number only major, so maybe that "third"
number is now the second number, AKA minor version), and also that's a
CLDR minor version change; is CLDR minor even in the recipe for
ucol_getVersion()? Even without data changes, I guess that bug fixes
could apply to the UCA logic, and I assume that UCA logic is included
in it. Hmm.

A non-hypothetical example of a CLDR change within an ICU major
version that I've been able to find is:

https://cldr.unicode.org/index/downloads/cldr-38

Here we see that CLDR had a minor version bump 38 -> 38.1, "a very
small number of incremental additions to version 38 to address the
specific bugs listed in Δ38.1", and was included in ICU 68.2. Being a
minor ICU release 68.1 -> 68.2, perhaps you could finish up running
that just with a regular upgrade on typical distros (not a major OS
upgrade), and since PostgreSQL would normally be linked against eg
.68, not .68.1, it'd start using it at the next cluster start when
that symlink is updated to point to .68.2. As it happens, if you
follow the documentation links to see what actually changed in that
particular pair of CLDR+ICU minor releases, it's timezones and locale
stuff other than collations, so wouldn't affect us. Can we find a
chapter and verse that says that ICU would only ever move to a new
CLDR in a minor release, and CLDR would never change order of
pre-existing code points in a minor release?

It might be interesting to see if
https://github.com/unicode-org/icu/tree/release-68-1 and
https://github.com/unicode-org/icu/tree/release-68-2 report a
different ucol_getVersion() for any locale, but not conclusive if it
doesn't; it might be because something in the version pipeline knew
that particular CLDR change didn't affect collators...

This speculation feels pretty useless. Maybe we should go and read
the code or ask an ICU expert, but I'm not against making it
theoretically possible to access two different minor versions at once,
just to cover all the bases for future-proofing.

2. Would package managers ever allow two minor versions to be
installed at once? I highly doubt it; they're probably more
interested in ABI stability so that dependent packages work when
bugfixes are shipped, and that's certainly nailed down at the major
version level. It'd probably be a case of having to compile it
yourself, which seems unlikely to me in the real world. That's why I
left minor version out of earlier patches, but I'm OK with changing
that.

As for how, I think that depends on our modelling decision (see below).

Search-by-collversion would still be frustrating for the admin, but at
least it would be possible to fix by compiling their own 67.X and
asking Postgres to search that library, too. We could make it slightly
more friendly by having an error that reports the libraries searched
and the collation versions found, if none of the versions match. We can
have a GUC that controls whether a failure to find the right version is
a WARNING or an ERROR.

Good ideas.

I tried to write up some docs. It's hard to explain why we are exposing
to the user the collation version and the library version in these
different ways, and what effects they have.

Always a good test: see how crazy it sounds when translated to user speak.

The current patch feels like it hasn't decided whether the collation
version is ucol_getVersion() (collversion) or u_getVersion() (library
version). The collversion is more prominent in the UI (with its own
syntax), yet it's just a cross-check for whether to issue a WARNING or
not; while the library version is hidden in the locale field and it
actually decides which symbol is called.

Yeah. I agree that it sucks to have two kinds of versions flying
around in the user's mind.

Yeah. I just don't like the way it *appears* to be doing something
clever, but
it doesn't solve any fundamental problem at all because the
collversion
information is under human control and so it's really doing something
stupid.

I assume by "human control" you mean "ALTER COLLATION ... REFRESH
VERSION". I agree that relying on the admin's declaration is dubious,
especially when we provide no good advice on how to actually do that
safely.

But I don't see what using the library version instead buys us here,
except that library version is part of the LOCALE, and there's no ALTER
command for that. You could just as easily deprecate/eliminate the
ALTER COLLATION REFRESH VERSION, and then say that the collversion is
out of human control, too.

By introducing multiple libraries, I think we need to change that
syntax anyway, to be something like:

ALTER COLLATION ... SET VERSION TO '...'

or even:

ALTER COLLATION ... FORCE VERSION TO '...'

OK. Time for a new list of the various models we've discussed so far:

1. search-by-collversion: We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.
Whenever opening a collation or database, the system will search some
candidate list of ICU libraries to try to find the one that agrees
with [dat]collversion. When creating a new collation or database, the
system will select one (probably the linked one unless you override
somehow) and record ucol_getVersion() in [dat]collversion. When
searching, it might fail to find a suitable library and ereport; to
fix that, it is the admin's job to somehow expand the set of candidate
libraries. In such a failure case, perhaps it would fall back to
using some default library version (probably the one that is linked,
overridable by GUC?), with a WARNING (unless you turned on ERRORs),
and if you want to shut it up without supplying the right candidate
library, you can still fall back to the REFRESH VERSION hammer (or
maybe that should indeed called FORCE to make it clearer that it's not
a harmless operation where the system holds your hand, you're actually
certifying that you have rebuilt indexes and you know what you're
doing).

The set of candidate versions could perhaps be provided with
extra_icu_library_versions=63,71 OR =63.1,63.2 strings, at least on
Unix systems following the traditional symlink conventions.
Remembering that a typical Unixoid system should have libraries and
symlinks like:

libicui18n.a
libicui18n.so -> libicui18n.so.71.1
libicui18n.so.63 -> libicui18n.so.63.1
libicui18n.so.63.1
libicui18n.so.67 -> libicui18n.so.67.1
libicui18n.so.67.1
libicui18n.so.71 -> libicui18n.so.71.1
libicui18n.so.71.1

The reason I prefer major[.minor] strings over whole library names is
that we need to dlopen two of them so it's a little easier to build
them from those parts than have to supply both names. The reason I
prefer to keep allowing major-only versions to be listed is that it's
good to have the option to just follow minor upgrades automatically.
Or I guess you could make something that can automatically search a
whole directory (which directory?) to find all the suitably named
libraries so you don't ever have to mention versions manually (if you
want "apt-get install libicu72" to be enough with no GUC change
needed) -- is that too weird?

Perhaps we could write functions that can show you the available
versions to demystify the searching mechanism slightly and show how
various numbers relate, something like (warning: I made up numbers for
illustration, they are wrong!):

SELECT * FROM pg_available_icu_libraries()

icu_version unicode_version uca_version cldr_version
67.1 14.0 3.1 38.0
71.1 15.0 4.0 42.0

SELECT * FROM pg_available_icu_collation_versions('en')

icu_version collation_version
67.1 142.42
71.1 153.112

2. lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today. The collversion column is used only for warnings. Warnings
are expected when you used the unversioned ICU provider and upgrade to
a binary linked to a later library. You can clear the warnings by
doing ALTER COLLATION/DATABASE SET [LOCALE_]PROVIDER = ICU63, or with
the REFRESH VERSION hammer.

Not sure how you fit minor versions into that, if we want to support
those. Maybe ICU means "whatever is linked", ICU63 means "whatever
libicui18n.so.63 points to" and ICU63_1 means libicu18n.so.63.1,
something like that, so the user can choose from three levels of
specificity.

3. lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion). Similar
to the above, but it's a separate property and the provider is always
ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
ICU_VERSION.

4. lib-version-in-locale: "63:en" from earlier versions. That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work. It doesn't sound like anyone really likes
this.

5. lib-version-in-collversion: We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion. I haven't analysed this much but I don't think it
has a very nice upgrade path from PG15, and it forces you to decide
whether to store just the major version and not even notice when the
(unstored) minor version changes, or store major.minor and
complain/break down when routine minor upgrades happen. It is a
logical possibility though, once you decide you only want one kind of
version in the system.

I'm willing to update the patch to try one of these out so we can kick
the tyres some more, but I'll wait to see if we can get some consensus
on the way forward. Despite my initial reactions, I'm willing to try
out the search-by-collversion concept if others are keen on it. The
example I worked through in the first paragraph of this email helped
me warm to it a little, and with the observability functions I showed
you might have a chance of figuring out what's going on in some edge
cases. Any other ideas, or votes for these ideas?

#95

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#94)

Re: Collation version tracking for macOS

On Wed, 2022-11-23 at 18:08 +1300, Thomas Munro wrote:

(1) the default behaviour on failure to search would
likely be to use the linked library instead and WARN about
[dat]collversion mismatch, so far the same, and

Agreed.

(2) the set of people
who would really be prepared to compile their own copy of 67.X
instead
of downgrading or REFRESHing (with or without rebuilding) is
vanishingly small.

The set of people prepared to do so is probably small. But the set of
people who will do it (prepared or not) when a problem comes up is
significantly larger ;-)

1. *Do* they change ucol_getVersion() values in minor releases? I
tried to find a written policy on that.

It seems like a valid concern. The mere existence of a collation
version separate from the library major version seems to suggest that
it's possible. Perhaps they avoid it in most cases; but absent a
specific policy against it, the separate collation version seems to
allow them the freedom to do so.

This speculation feels pretty useless. Maybe we should go and read
the code or ask an ICU expert, but I'm not against making it
theoretically possible to access two different minor versions at
once,
just to cover all the bases for future-proofing.

I don't think this should be an overriding concern that drives the
whole design. It is a nudge in favor of search-by-collversion.

2. Would package managers ever allow two minor versions to be
installed at once? I highly doubt it;

Agreed.

I'm sure this has been discussed, but which distros even support
multiple major versions of ICU?

1. search-by-collversion: We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.
Whenever opening a collation or database, the system will search some
candidate list of ICU libraries to try to find the one that agrees
with [dat]collversion.

[...]

The reason I prefer major[.minor] strings over whole library names is
that we need to dlopen two of them so it's a little easier to build
them from those parts than have to supply both names.

It also makes it easier to know which version suffixes to look for.

The reason I
prefer to keep allowing major-only versions to be listed is that it's
good to have the option to just follow minor upgrades automatically.

Makes sense.

Or I guess you could make something that can automatically search a
whole directory (which directory?) to find all the suitably named
libraries so you don't ever have to mention versions manually (if you
want "apt-get install libicu72" to be enough with no GUC change
needed) -- is that too weird?

That seems to go a little too far.

SELECT * FROM pg_available_icu_libraries()
SELECT * FROM pg_available_icu_collation_versions('en')

2. lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today.

I expressed interest in this approach before, but when you allowed ICU
compiled with --disable-renaming, that mitigated my concerns about when
to throw that error.

3. lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion).

I think this is the best among 2-4.

4. lib-version-in-locale: "63:en" from earlier versions. That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work. It doesn't sound like anyone really likes
this.

I don't see any advantage of this over 3.

5. lib-version-in-collversion: We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion.

The advantage here is that it's very easy to tell the admin what
library the collation is looking for, but the disadvantages you point
out seem a lot worse: migration problems from v15, and the minor
version question.

I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to find
the collation version among whatever libraries the admin can provide;
but adding some observability into the search should mitigate any
confusion.

Can you go over the advantages of approaches 2-4 again? Is it just a
concern about burdening the admin with finding the right ICU library
version for a given collation version? That's a valid concern, but I
don't think that should be an overriding design point. It seems more
important to model the collation versions properly.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#96

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#95)

Re: Collation version tracking for macOS

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:

I'm sure this has been discussed, but which distros even support
multiple major versions of ICU?

For Debian and friends, you can install any number of libicuNN
packages (if you can find them eg from previous release repos), but
there's only one libicu-dev. That means that one specific major
version is blessed by each Debian release and has its headers and
static libraries for you to use as a developer, but you can still
install the dynamic libraries from older releases at the same time to
satisfy the dependencies of packages or programs that were built on an
earlier OS release. They don't declare conflicts on each other and
they contain non-conflicting filenames. That's similar to the way
standard libraries and various other things are treated, for backward
compatibility.

For RHEL and friends, I'm pretty sure it's the same concept, but I
don't use those and haven't seen it with my own eyes.

I don't know for other Linux distros/families, but I expect the above
two cover a huge percentage of our users and I expect others to have
made similar choices.

For the BSDs, which tend to have a single binary package with both
headers and libraries owing to their origins as source-based
distributions (ports), the above way of thinking doesn't work; I
couldn't develop this on my usual FreeBSD battlestation without
building ICU myself (problem being that there's only one "pkg install
icu") and I hope to talk to someone who knows what to do about that
eventually. I want this to work there easily for end users.

macOS and Windows have so many different ways of installing things
that there isn't a single answer there; supposedly open source is like
a bazaar and closed source like a cathedral, but as far as package
management goes, it looks more like rubble to me.

#97

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#95)

Re: Collation version tracking for macOS

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:

I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to find
the collation version among whatever libraries the admin can provide;
but adding some observability into the search should mitigate any
confusion.

OK, it sounds like I should code that up next.

Can you go over the advantages of approaches 2-4 again? Is it just a
concern about burdening the admin with finding the right ICU library
version for a given collation version? That's a valid concern, but I
don't think that should be an overriding design point. It seems more
important to model the collation versions properly.

Yes, that's a good summary. The user has a problem, and the solution
is to find some version of ICU and install it, so the problem space
necessarily involves the other kind of version. My idea was that we
should therefore make that part of the model. But the observability
support does indeed make it a bit clearer what's going on.

#98

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#97)

2 attachment(s)

Re: Collation version tracking for macOS

On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:

I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to find
the collation version among whatever libraries the admin can provide;
but adding some observability into the search should mitigate any
confusion.

OK, it sounds like I should code that up next.

Here's the first iteration. The version rosetta stone functions look like this:

postgres=# select * from pg_icu_library_versions();
icu_version | unicode_version | cldr_version
-------------+-----------------+--------------
67.1 | 13.0 | 37.0
63.1 | 11.0 | 34.0
57.1 | 8.0 | 29.0
(3 rows)

postgres=# select * from pg_icu_collation_versions('zh');
icu_version | uca_version | collator_version
-------------+-------------+------------------
67.1 | 13.0 | 153.14.37
63.1 | 11.0 | 153.88.34
57.1 | 8.0 | 153.64.29
(3 rows)

It's no longer necessary to put anything in PG_TEST_EXTRA to run
"meson test irc/020_multiversion" usefully. It will find extra ICU
versions all by itself in your system library search path and SKIP if
it doesn't find a second major version. I have tried to cover the
main scenarios that I expect users to encounter in the update TAP
tests, with commentary that I hope will be helpful to assess the
usability of this thing.

Other changes:

* now using RTLD_LOCAL instead of RTLD_GLOBAL (I guess the latter
might cause trouble for someone using --disable-renaming, but I
haven't tested that and am not an expert on linker/loader arcana)
* fixed library names on Windows (based on reading the manual, but I
haven't tested that)
* fixed failure on non-ICU builds (the reason CI was failing in v7,
some misplaced #ifdefs)
* various cleanup
* I've attached a throwaway patch to install a second ICU version on
Debian/amd64 on CI, since otherwise the new test would SKIP on all
systems

This is just a first cut, but enough to try out and see if we like it,
what needs to be improved, what edge cases we haven't thought about
etc. Let me know what you think.

Attachments:

v8-0001-WIP-Multi-version-ICU.patchtext/x-patch; charset=US-ASCII; name=v8-0001-WIP-Multi-version-ICU.patchDownload

From 0d96bfbec02245ddce6c985250ff0f8d38e41df9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v8 1/2] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

Whenever creating a DATABASE or COLLATION object that uses ICU, we'll
use the "default" ICU library and record its ucol_getVersion() in the
catalog.  That's usually the one we're linked against but another can be
selected with the setting default_icu_version_library.

Whenever opening an existing DATABASE or COLLATION object that uses ICU,
we'll see the recorded [dat]collversion and try to find the ICU library
that provides that version.  If we can't, we'll fall back to using the
default ICU library with a warning that the user should either install
another ICU library version, or rebuild affected database objects and
REFRESH.

New GUCs:

icu_library_path

  A place to find ICU libraries, if not the default system library
  search path.

icu_library_versions

  A comma-separated list of ICU major or major.minor versions to make
  available to PostgreSQL, or * for every major version that can be
  found (the default).

default_icu_library_version

  The major or major.minor version to use for new objects and as a
  fallback (with warnings) if the right version can't be found.

Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c            |  16 +-
 src/backend/commands/collationcmds.c          |  20 +
 src/backend/utils/adt/formatting.c            |  53 +-
 src/backend/utils/adt/pg_locale.c             | 748 +++++++++++++++++-
 src/backend/utils/adt/varchar.c               |  16 +-
 src/backend/utils/adt/varlena.c               |  56 +-
 src/backend/utils/init/postinit.c             |  34 +-
 src/backend/utils/misc/guc_tables.c           |  40 +-
 src/backend/utils/misc/postgresql.conf.sample |  10 +
 src/include/catalog/pg_proc.dat               |  23 +
 src/include/utils/pg_locale.h                 |  85 +-
 src/test/icu/meson.build                      |   1 +
 src/test/icu/t/020_multiversion.pl            | 274 +++++++
 src/tools/pgindent/typedefs.list              |   4 +
 14 files changed, 1275 insertions(+), 105 deletions(-)
 create mode 100644 src/test/icu/t/020_multiversion.pl

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 81e54e0ce6..4fb0c77f38 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -853,6 +853,26 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 					CreateComments(collid, CollationRelationId, 0,
 								   icucomment);
 			}
+
+			/* Also create an object pinned to an ICU major version. */
+			collid = CollationCreate(psprintf("%s-x-icu-%d", langtag, U_ICU_VERSION_MAJOR_NUM),
+									 nspid, GetUserId(),
+									 COLLPROVIDER_ICU, true, -1,
+									 NULL, NULL,
+									 psprintf("%d:%s", U_ICU_VERSION_MAJOR_NUM, iculocstr),
+									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 true, true);
+			if (OidIsValid(collid))
+			{
+				ncreated++;
+
+				CommandCounterIncrement();
+
+				icucomment = get_icu_locale_comment(name);
+				if (icucomment)
+					CreateComments(collid, CollationRelationId, 0,
+								   icucomment);
+			}
 		}
 	}
 #endif							/* USE_ICU */
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 26f498b5df..0c3c7724d7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1599,6 +1599,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1623,18 +1628,41 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->info.icu.locale, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->info.icu.locale, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						PG_ICU_LIB(mylocale)->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1702,7 +1730,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToLower,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1824,7 +1853,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_conv = icu_convert_case(PG_ICU_LIB(mylocale)->strToUpper,
+										mylocale,
 										&buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
@@ -1947,8 +1977,9 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 			UChar	   *buff_conv;
 
 			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
+			len_conv = icu_convert_case_bi(PG_ICU_LIB(mylocale)->strToTitle,
+										   mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
 			icu_from_uchar(&result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..004100af66 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -57,7 +57,9 @@
 #include "access/htup_details.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
+#include "funcapi.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/guc_hooks.h"
@@ -69,6 +71,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -79,14 +83,36 @@
 #include <shlwapi.h>
 #endif
 
+#include <dlfcn.h>
+
 #define		MAX_L10N_DATA		80
 
+#ifdef USE_ICU
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR_VERSION U_ICU_VERSION_MAJOR_NUM
+
+/*
+ * The oldest ICU release we're likely to encounter, and that has all the
+ * funcitons required.
+ */
+#define PG_MIN_ICU_MAJOR_VERSION 50
+
+#endif
+
 
 /* GUC settings */
 char	   *locale_messages;
 char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
+char	   *icu_library_path;
+char	   *icu_library_versions;
+char	   *default_icu_library_version;
 
 /*
  * lc_time localization cache.
@@ -123,7 +149,9 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *lib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1400,33 +1428,544 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
-void
+#ifdef USE_ICU
+
+static pg_icu_library *icu_library_list;
+static pg_icu_library *default_icu_library;
+static bool icu_library_list_fully_loaded;
+
+static void *
+get_icu_function(void *handle, const char *function, int version)
+{
+	char		function_with_version[80];
+	void	   *result;
+
+	/*
+	 * Try to look it up using the symbols with major versions, but if that
+	 * doesn't work, also try the unversioned name in case the library was
+	 * configured with --disable-renaming.
+	 */
+	snprintf(function_with_version, sizeof(function_with_version), "%s_%d",
+			 function, version);
+	result = dlsym(handle, function_with_version);
+
+	return result ? result : dlsym(handle, function);
+}
+
+static void
+make_icu_library_name(char *output,
+					  const char *name,
+					  int major_version,
+					  int minor_version)
+{
+	/*
+	 * See
+	 * https://unicode-org.github.io/icu/userguide/icu4c/packaging.html#icu-versions
+	 * for conventions on library naming on POSIX and Windows systems.  Apple
+	 * isn't mentioned but varies in the usual way.
+	 *
+	 * Format 1 is expected to be a major version-only symlink pointing to a
+	 * specific minor version (or on Windows it may be the actual library).
+	 * Format 2 is expected to be an actual library.
+	 */
+#ifdef WIN32
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%sicu%s%d" DLSUFFIX
+#define ICU_LIBRARY_NAME_FORMAT2 "%s%sicu%s%d.%d" DLSUFFIX
+#elif defined(__darwin__)
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%slibicu%s.%d" DLSUFFIX
+#define ICU_LIBRARY_NAME_FORMAT2 "%s%slibicu%s.%d.%d" DLSUFFIX
+#else
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%slibicu%s" DLSUFFIX ".%d"
+#define ICU_LIBRARY_NAME_FORMAT2 "%s%slibicu%s" DLSUFFIX ".%d.%d"
+#endif
+
+#ifdef WIN32
+#define PATH_SEPARATOR "\\"
+#define ICU_I18N "in"
+#define ICU_UC "uc"
+#else
+#define PATH_SEPARATOR "/"
+#define ICU_I18N "i18n"
+#define ICU_UC "uc"
+#endif
+
+	if (minor_version < 0)
+		snprintf(output,
+				 MAXPGPATH,
+				 ICU_LIBRARY_NAME_FORMAT1,
+				 icu_library_path,
+				 icu_library_path[0] ? PATH_SEPARATOR : "",
+				 name,
+				 major_version);
+	else
+		snprintf(output,
+				 MAXPGPATH,
+				 ICU_LIBRARY_NAME_FORMAT2,
+				 icu_library_path,
+				 icu_library_path[0] ? PATH_SEPARATOR : "",
+				 name,
+				 major_version,
+				 minor_version);
+}
+
+/*
+ * Given an ICU library major version and optionally minor version (or -1 for
+ * any), return the object we need to access all the symbols in the pair of
+ * libraries we need.  Returns NULL if the library can't be found.  Returns
+ * NULL and logs a warning if the library can be found but cannot be used for
+ * some reason.
+ */
+static pg_icu_library *
+load_icu_library(int major_version, int minor_version)
+{
+	UVersionInfo version_info;
+	pg_icu_library *lib;
+	void	   *libicui18n_handle = NULL;
+	void	   *libicuuc_handle = NULL;
+
+	/*
+	 * We don't dare open libraries outside the range that we know has an API
+	 * compatible with the headers we are compiling against.
+	 */
+	if (major_version < PG_MIN_ICU_MAJOR_VERSION ||
+		major_version > PG_MAX_ICU_MAJOR_VERSION)
+	{
+		elog(WARNING,
+			 "ICU version must be between %d and %d",
+			 PG_MIN_ICU_MAJOR_VERSION,
+			 PG_MAX_ICU_MAJOR_VERSION);
+		return NULL;
+	}
+
+	/*
+	 * We were compiled against a certain version of ICU, though the minor
+	 * version might have changed if the library was upgraded.  Does it
+	 * satisfy the request?
+	 */
+	u_getVersion(version_info);
+	if (version_info[0] == major_version &&
+		(minor_version == -1 || version_info[1] == minor_version))
+	{
+		/*
+		 * These assignments will fail to compile if an incompatible API
+		 * change is made to some future version of ICU, at which point we
+		 * might need to consider special treatment for different major
+		 * version ranges, with intermediate trampoline functions.
+		 */
+		lib = MemoryContextAllocZero(TopMemoryContext, sizeof(*lib));
+		lib->getICUVersion = u_getVersion;
+		lib->getUnicodeVersion = u_getUnicodeVersion;
+		lib->getCLDRVersion = ulocdata_getCLDRVersion;
+		lib->open = ucol_open;
+		lib->close = ucol_close;
+		lib->getCollatorVersion = ucol_getVersion;
+		lib->getUCAVersion = ucol_getUCAVersion;
+		lib->versionToString = u_versionToString;
+		lib->strcoll = ucol_strcoll;
+		lib->strcollUTF8 = ucol_strcollUTF8;
+		lib->getSortKey = ucol_getSortKey;
+		lib->nextSortKeyPart = ucol_nextSortKeyPart;
+		lib->setUTF8 = uiter_setUTF8;
+		lib->errorName = u_errorName;
+		lib->strToUpper = u_strToUpper;
+		lib->strToLower = u_strToLower;
+		lib->strToTitle = u_strToTitle;
+		lib->setAttribute = ucol_setAttribute;
+
+		/*
+		 * Also assert the size of a couple of types used as output buffers,
+		 * as a canary to tell us to add extra padding in the (unlikely) event
+		 * that a later release makes these values smaller.
+		 */
+		StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+						 "u_versionToString output buffer size changed incompatibly");
+		StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+						 "ucol_getVersion output buffer size changed incompatibly");
+	}
+	else
+	{
+		/* This is an older version, so we'll need to use dlopen(). */
+		char		libicui18n_name[MAXPGPATH];
+		char		libicuuc_name[MAXPGPATH];
+
+		/* Load the internationalization library. */
+		make_icu_library_name(libicui18n_name, ICU_I18N, major_version, minor_version);
+		libicui18n_handle = dlopen(libicui18n_name, RTLD_NOW | RTLD_LOCAL);
+		if (!libicui18n_handle)
+			return NULL;
+
+		/* Load the common library. */
+		make_icu_library_name(libicuuc_name, ICU_UC, major_version, minor_version);
+		libicuuc_handle = dlopen(libicuuc_name, RTLD_NOW | RTLD_LOCAL);
+		if (!libicui18n_handle)
+		{
+			elog(WARNING, "found library \"%s\" but not companion library \"%s\"",
+				 libicui18n_name, libicuuc_name);
+			dlclose(libicui18n_handle);
+			return NULL;
+		}
+
+		/*
+		 * We only allocate the pg_icu_library object after successfully
+		 * opening the libraries to minimize the work done in the ENOENT case,
+		 * when probing a range of versions.  That means we might need to
+		 * clean up on allocation failure.
+		 */
+		lib = MemoryContextAllocExtended(TopMemoryContext, sizeof(*lib),
+										 MCXT_ALLOC_NO_OOM);
+		if (!lib)
+		{
+			dlclose(libicui18n_handle);
+			dlclose(libicuuc_handle);
+			elog(ERROR, "out of memory");
+		}
+
+		/* Now try to find all the symbols we need. */
+		lib->getICUVersion = get_icu_function(libicui18n_handle,
+											  "u_getVersion",
+											  major_version);
+		lib->getUnicodeVersion = get_icu_function(libicui18n_handle,
+												  "u_getUnicodeVersion",
+												  major_version);
+		lib->getCLDRVersion = get_icu_function(libicui18n_handle,
+											   "ulocdata_getCLDRVersion",
+											   major_version);
+		lib->open = get_icu_function(libicui18n_handle,
+									 "ucol_open",
+									 major_version);
+		lib->close = get_icu_function(libicui18n_handle,
+									  "ucol_close",
+									  major_version);
+		lib->getCollatorVersion = get_icu_function(libicui18n_handle,
+												   "ucol_getVersion",
+												   major_version);
+		lib->getUCAVersion = get_icu_function(libicui18n_handle,
+											  "ucol_getUCAVersion",
+											  major_version);
+		lib->versionToString = get_icu_function(libicui18n_handle,
+												"u_versionToString",
+												major_version);
+		lib->strcoll = get_icu_function(libicui18n_handle,
+										"ucol_strcoll",
+										major_version);
+		lib->strcollUTF8 = get_icu_function(libicui18n_handle,
+											"ucol_strcollUTF8",
+											major_version);
+		lib->getSortKey = get_icu_function(libicui18n_handle,
+										   "ucol_getSortKey",
+										   major_version);
+		lib->nextSortKeyPart = get_icu_function(libicui18n_handle,
+												"ucol_nextSortKeyPart",
+												major_version);
+		lib->setUTF8 = get_icu_function(libicui18n_handle,
+										"uiter_setUTF8",
+										major_version);
+		lib->errorName = get_icu_function(libicui18n_handle,
+										  "u_errorName",
+										  major_version);
+		lib->strToUpper = get_icu_function(libicuuc_handle,
+										   "u_strToUpper",
+										   major_version);
+		lib->strToLower = get_icu_function(libicuuc_handle,
+										   "u_strToLower",
+										   major_version);
+		lib->strToTitle = get_icu_function(libicuuc_handle,
+										   "u_strToTitle",
+										   major_version);
+		lib->setAttribute = get_icu_function(libicui18n_handle,
+											 "ucol_setAttribute",
+											 major_version);
+
+		/* Did we find everything? */
+		if (!lib->getICUVersion ||
+			!lib->getUnicodeVersion ||
+			!lib->getCLDRVersion ||
+			!lib->open ||
+			!lib->close ||
+			!lib->getCollatorVersion ||
+			!lib->getUCAVersion ||
+			!lib->versionToString ||
+			!lib->strcoll ||
+			!lib->strcollUTF8 ||
+			!lib->getSortKey ||
+			!lib->nextSortKeyPart ||
+			!lib->setUTF8 ||
+			!lib->errorName ||
+			!lib->strToUpper ||
+			!lib->strToLower ||
+			!lib->strToTitle ||
+			!lib->setAttribute)
+		{
+			dlclose(libicui18n_handle);
+			dlclose(libicuuc_handle);
+			pfree(lib);
+			ereport(WARNING,
+					(errmsg("could not find all expected symbols in libraries \"%s\" and \"%s\"",
+							libicui18n_name, libicuuc_name)));
+			return NULL;
+		}
+	}
+
+	/* Is this major.minor already loaded? */
+	lib->getICUVersion(version_info);
+	lib->major_version = version_info[0];
+	lib->minor_version = version_info[1];
+	for (pg_icu_library *lib2 = icu_library_list; lib2; lib2 = lib2->next)
+	{
+		if (lib2->major_version == lib->major_version &&
+			lib2->minor_version == lib->minor_version)
+		{
+			if (libicui18n_handle)
+				dlclose(libicui18n_handle);
+			if (libicuuc_handle)
+				dlclose(libicuuc_handle);
+			pfree(lib);
+
+			/* Return the one we already had. */
+			return lib2;
+		}
+	}
+
+	/* Add to list of loaded libraries. */
+	lib->next = icu_library_list;
+	icu_library_list = lib;
+
+	return lib;
+}
+
+static pg_icu_library *
+get_icu_library_list(void)
+{
+	char	   *copy;
+	char	   *token;
+	char	   *saveptr;
+
+	if (icu_library_list_fully_loaded)
+		return icu_library_list;
+
+	copy = pstrdup(icu_library_versions);
+	token = strtok_r(copy, ",", &saveptr);
+	while (token)
+	{
+		int			major_version;
+		int			minor_version;
+
+		/* Ignore spaces between commas. */
+		while (*token == ' ')
+			++token;
+
+		if (strcmp(token, "*") == 0)
+		{
+			/* Try to load every supportable major library version. */
+			for (int i = PG_MIN_ICU_MAJOR_VERSION; i <= PG_MAX_ICU_MAJOR_VERSION; ++i)
+				load_icu_library(i, -1);
+		}
+		else if (sscanf(token, "%d.%d", &major_version, &minor_version) == 2)
+		{
+			/* Try to load a version with an explicit minor version provided. */
+			if (!load_icu_library(major_version, minor_version))
+				ereport(WARNING,
+						(errmsg("could not open ICU library \"%s\"", token)));
+		}
+		else if (sscanf(token, "%d", &major_version) == 1)
+		{
+			/* Try to load a major version through symlinks. */
+			if (!load_icu_library(major_version, -1))
+				ereport(WARNING,
+						(errmsg("could not open ICU library \"%s\"", token)));
+		}
+		else
+			ereport(WARNING,
+					(errmsg("could not parse ICU library version \"%s\"", token)));
+
+		token = strtok_r(NULL, ",", &saveptr);
+	}
+	pfree(copy);
+
+	icu_library_list_fully_loaded = true;
+
+	return icu_library_list;
+}
+
+static pg_icu_library *
+get_default_icu_library(void)
+{
+	int			major_version;
+	int			minor_version;
+
+	if (default_icu_library)
+		return default_icu_library;
+
+	if (default_icu_library_version[0] == 0)
+	{
+		/* Use the linked version by default. */
+		default_icu_library = load_icu_library(PG_MAX_ICU_MAJOR_VERSION, -1);
+		Assert(default_icu_library);
+	}
+	else if (sscanf(default_icu_library_version, "%d.%d", &major_version, &minor_version) == 2)
+	{
+		/* Try to load a version with an explicit major.minor version. */
+		default_icu_library = load_icu_library(major_version, minor_version);
+	}
+	else if (sscanf(default_icu_library_version, "%d", &major_version) == 1)
+	{
+		/* Try to load a version using only major (usually a symlink on Unix). */
+		default_icu_library = load_icu_library(major_version, -1);
+	}
+	else
+	{
+		ereport(WARNING,
+				(errmsg("could not parse default_icu_library_version \"%s\"",
+						default_icu_library_version)));
+	}
+
+	if (!default_icu_library_version)
+	{
+		/*
+		 * Fall back to the linked version with a warning if the above
+		 * attempts failed.
+		 */
+		default_icu_library = load_icu_library(PG_MAX_ICU_MAJOR_VERSION, -1);
+		Assert(default_icu_library);
+		ereport(WARNING,
+				(errmsg("could not load ICU library version \"%s\", so using linked version %d.%d instead",
+						default_icu_library_version,
+						default_icu_library->major_version,
+						default_icu_library->minor_version)));
+		Assert(default_icu_library);
+	}
+
+	return default_icu_library;
+}
+
+/*
+ * Try to open a collator with a specific version from a given library.
+ * Returns NULL on failure.
+ */
+static UCollator *
+get_icu_collator(pg_icu_library *lib,
+				 const char *locale,
+				 const char *collversion)
+{
+	UErrorCode	status;
+	UCollator  *collator;
+	UVersionInfo version_info;
+	char		version_info_string[U_MAX_VERSION_STRING_LENGTH];
+
+	/* Can we even open it? */
+	status = U_ZERO_ERROR;
+	collator = lib->open(locale, &status);
+	if (!collator)
+		return NULL;
+
+	/*
+	 * Does it have the requested version?  We tolerate a null collversion
+	 * argument only for bootrapping in initdb --locale-provider=icu, where we
+	 * accept the first library we try.
+	 */
+	if (collversion)
+	{
+		lib->getCollatorVersion(collator, version_info);
+		lib->versionToString(version_info, version_info_string);
+		if (strcmp(version_info_string, collversion) != 0)
+		{
+			lib->close(collator);
+			return NULL;
+		}
+	} else
+		Assert(!IsUnderPostmaster);
+
+	/* XXX this can raise an error and leak collator! */
+	if (lib->major_version < 54)
+		icu_set_collation_attributes(lib, collator, locale);
+
+	return collator;
+}
+
+#endif
+
+/*
+ * Returns true if a collator with u_getVersion() matching collversion could
+ * not be found in any available ICU library, so the default library was used
+ * instead.
+ */
+bool
 make_icu_collator(const char *iculocstr,
+				  const char *collversion,
 				  struct pg_locale_struct *resultp)
 {
+	bool		using_default = false;
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	pg_icu_library *lib = NULL;
+	UCollator  *collator = NULL;
 
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+	/*
+	 * Try the default library first, which might avoid the need to dlopen()
+	 * libraries in the common case that it's the version we're linked
+	 * against.
+	 */
+	lib = get_default_icu_library();
+	collator = get_icu_collator(lib, iculocstr, collversion);
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
+	/*
+	 * If that didn't succeed, try every available library.
+	 */
+	if (!collator)
+	{
+		for (lib = get_icu_library_list(); lib; lib = lib->next)
+		{
+			collator = get_icu_collator(lib, iculocstr, collversion);
+			if (collator)
+				break;
+		}
+	}
+
+	/*
+	 * If we didn't find a match, it's time to fall back to our default
+	 * library.  We'll also return true so the caller can generate a more
+	 * specific warning about what to do.
+	 */
+	if (!collator)
+	{
+		UVersionInfo version_info;
+		char		version_info_string[U_MAX_VERSION_STRING_LENGTH];
+		UErrorCode	status;
+
+		lib = get_default_icu_library();
+
+		status = U_ZERO_ERROR;
+		collator = lib->open(iculocstr, &status);
+		if (!collator)
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							iculocstr, lib->errorName(status))));
+
+		lib->getCollatorVersion(collator, version_info);
+		lib->versionToString(version_info, version_info_string);
+		ereport(WARNING,
+				(errmsg("could not find ICU collator for locale \"%s\" with "
+						"version %s, so using version %s from default "
+						"ICU library %d.%d instead",
+						iculocstr, collversion, version_info_string,
+						lib->major_version, lib->minor_version)));
+
+		using_default = true;
+	}
+
+	Assert(lib);
 
 	/* We will leak this string if the caller errors later :-( */
 	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
 	resultp->info.icu.ucol = collator;
+	resultp->info.icu.lib = lib;
 #else							/* not USE_ICU */
 	/* could get here if a collation was created by a build with ICU */
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 			 errmsg("ICU is not supported in this build")));
 #endif							/* not USE_ICU */
+
+	return using_default;
 }
 
 
@@ -1504,6 +2043,7 @@ pg_newlocale_from_collation(Oid collid)
 		pg_locale_t resultp;
 		Datum		datum;
 		bool		isnull;
+		const char *collversion;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
@@ -1515,6 +2055,10 @@ pg_newlocale_from_collation(Oid collid)
 		result.provider = collform->collprovider;
 		result.deterministic = collform->collisdeterministic;
 
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		collversion = isnull ? NULL : TextDatumGetCString(datum);
+
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
 #ifdef HAVE_LOCALE_T
@@ -1584,23 +2128,37 @@ pg_newlocale_from_collation(Oid collid)
 			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
 			Assert(!isnull);
 			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			if (!collversion)
+				elog(ERROR, "ICU collation lacks version");
+			if (make_icu_collator(iculocstr, collversion, &result))
+			{
+				ereport(WARNING,
+						errmsg("collation \"%s\" version mismatch",
+							   NameStr(collform->collname)),
+						errdetail("The collation in the database was created using "
+								  "locale \"%s\" version %s, "
+								  "but no ICU library with a matching collator is available",
+								  iculocstr, collversion),
+						errhint("Install a version of ICU that provides locale \"%s\" "
+								"version %s, or rebuild all objects "
+								"affected by this collation and run "
+								"ALTER COLLATION %s REFRESH VERSION.",
+								iculocstr, collversion,
+								quote_qualified_identifier(get_namespace_name(collform->collnamespace),
+														   NameStr(collform->collname))));
+			}
 		}
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		if (collform->collprovider == COLLPROVIDER_LIBC && collversion)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
+			char	   *locale;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
 			Assert(!isnull);
+			locale = TextDatumGetCString(datum);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
+			actual_versionstr = get_collation_actual_version(collform->collprovider, locale);
 			if (!actual_versionstr)
 			{
 				/*
@@ -1613,18 +2171,20 @@ pg_newlocale_from_collation(Oid collid)
 								NameStr(collform->collname))));
 			}
 
-			if (strcmp(actual_versionstr, collversionstr) != 0)
+			if (strcmp(actual_versionstr, collversion) != 0)
+			{
 				ereport(WARNING,
 						(errmsg("collation \"%s\" has version mismatch",
 								NameStr(collform->collname)),
 						 errdetail("The collation in the database was created using version %s, "
 								   "but the operating system provides version %s.",
-								   collversionstr, actual_versionstr),
+								   collversion, actual_versionstr),
 						 errhint("Rebuild all objects affected by this collation and run "
 								 "ALTER COLLATION %s REFRESH VERSION, "
 								 "or build PostgreSQL with the right library version.",
 								 quote_qualified_identifier(get_namespace_name(collform->collnamespace),
 															NameStr(collform->collname)))));
+			}
 		}
 
 		ReleaseSysCache(tp);
@@ -1651,21 +2211,27 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
+		pg_icu_library *lib;
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		/*
+		 * Use the default library, but other versions might also be active
+		 * and can be seen with pg_icu_collation_versions().
+		 */
+		lib = get_default_icu_library();
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = lib->open(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, lib->errorName(status))));
+		lib->getCollatorVersion(collator, versioninfo);
+		lib->close(collator);
 
-		u_versionToString(versioninfo, buf);
+		lib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -1731,8 +2297,110 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+Datum
+pg_icu_library_versions(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+#define PG_ICU_AVAILABLE_ICU_LIRBARIES_COLS 3
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum		values[PG_ICU_AVAILABLE_ICU_LIRBARIES_COLS];
+	bool		nulls[PG_ICU_AVAILABLE_ICU_LIRBARIES_COLS];
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (pg_icu_library *lib = get_icu_library_list(); lib; lib = lib->next)
+	{
+		UErrorCode	status;
+		UVersionInfo version_info;
+		char		version_string[U_MAX_VERSION_STRING_LENGTH];
+
+		lib->getICUVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[0] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[0] = false;
+
+		lib->getUnicodeVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[1] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[1] = false;
+
+		status = U_ZERO_ERROR;
+		lib->getCLDRVersion(version_info, &status);
+		if (U_SUCCESS(status))
+		{
+			lib->versionToString(version_info, version_string);
+			values[2] = PointerGetDatum(cstring_to_text(version_string));
+			nulls[2] = false;
+		}
+		else
+		{
+			nulls[2] = true;
+		}
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+#endif
+
+	return (Datum) 0;
+}
+
+Datum
+pg_icu_collation_versions(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+#define PG_ICU_AVAILABLE_ICU_LIRBARIES_COLS 3
+	const char *locale = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum		values[PG_ICU_AVAILABLE_ICU_LIRBARIES_COLS];
+	bool		nulls[PG_ICU_AVAILABLE_ICU_LIRBARIES_COLS];
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (pg_icu_library *lib = get_icu_library_list(); lib; lib = lib->next)
+	{
+		UErrorCode	status;
+		UCollator  *collator;
+		UVersionInfo version_info;
+		char		version_string[U_MAX_VERSION_STRING_LENGTH];
+
+		status = U_ZERO_ERROR;
+		collator = lib->open(locale, &status);
+		if (!collator)
+		{
+			if (U_FAILURE(status))
+				ereport(WARNING,
+						(errmsg("could not open collator for locale \"%s\" from ICU %d.%d: %s",
+								locale,
+								lib->major_version,
+								lib->minor_version,
+								lib->errorName(status))));
+			continue;
+		}
+
+		lib->getICUVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[0] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[0] = false;
+
+		lib->getUCAVersion(collator, version_info);
+		lib->versionToString(version_info, version_string);
+		values[1] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[1] = false;
+
+		lib->getCollatorVersion(collator, version_info);
+		lib->versionToString(version_info, version_string);
+		values[2] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[2] = false;
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+#endif
+
+	return (Datum) 0;
+}
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -1855,9 +2523,10 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  * ucol_open(), so this is only necessary for emulating this behavior on older
  * versions.
  */
-pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *lib,
+							 UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -1886,6 +2555,8 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 
 			/*
 			 * See attribute name and value lists in ICU i18n/coll.cpp
+			 *
+			 * XXX Are these enumerator values stable across releases?
 			 */
 			if (strcmp(name, "colstrength") == 0)
 				uattr = UCOL_STRENGTH;
@@ -1931,7 +2602,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				lib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -1940,7 +2611,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, lib->errorName(status))));
 		}
 	}
 }
@@ -1954,19 +2625,18 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
+	pg_icu_library *lib;
 	UCollator  *collator;
 	UErrorCode	status;
 
+	lib = get_default_icu_library();
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = lib->open(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+						icu_locale, lib->errorName(status))));
+	lib->close(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 68e2e6f7a7..e0c86870e0 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1026,11 +1026,11 @@ hashbpchar(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, keydata, keylen);
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -1087,11 +1087,11 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+													 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+											 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c5e7ee7ca2..cf891a5654 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1667,13 +1667,14 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					UErrorCode	status;
 
 					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
+					result = PG_ICU_LIB(mylocale)->strcollUTF8(PG_ICU_COL(mylocale),
+															   arg1, len1,
+															   arg2, len2,
+															   &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
+								(errmsg("collation failed: %s",
+										PG_ICU_LIB(mylocale)->errorName(status))));
 				}
 				else
 #endif
@@ -1686,9 +1687,9 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
 					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
 
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
+					result = PG_ICU_LIB(mylocale)->strcoll(PG_ICU_COL(mylocale),
+														   uchar1, ulen1,
+														   uchar2, ulen2);
 
 					pfree(uchar1);
 					pfree(uchar2);
@@ -2388,13 +2389,14 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				UErrorCode	status;
 
 				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
+				result = PG_ICU_LIB(sss->locale)->strcollUTF8(PG_ICU_COL(sss->locale),
+															  a1p, len1,
+															  a2p, len2,
+															  &status);
 				if (U_FAILURE(status))
 					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
+							(errmsg("collation failed: %s",
+									PG_ICU_LIB(sss->locale)->errorName(status))));
 			}
 			else
 #endif
@@ -2407,9 +2409,9 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
 				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
 
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
+				result = PG_ICU_LIB(sss->locale)->strcoll(PG_ICU_COL(sss->locale),
+														  uchar1, ulen1,
+														  uchar2, ulen2);
 
 				pfree(uchar1);
 				pfree(uchar2);
@@ -2569,24 +2571,24 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 					uint32_t	state[2];
 					UErrorCode	status;
 
-					uiter_setUTF8(&iter, sss->buf1, len);
+					PG_ICU_LIB(sss->locale)->setUTF8(&iter, sss->buf1, len);
 					state[0] = state[1] = 0;	/* won't need that again */
 					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
+					bsize = PG_ICU_LIB(sss->locale)->nextSortKeyPart(PG_ICU_COL(sss->locale),
+																	 &iter,
+																	 state,
+																	 (uint8_t *) sss->buf2,
+																	 Min(sizeof(Datum), sss->buflen2),
+																	 &status);
 					if (U_FAILURE(status))
 						ereport(ERROR,
 								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
+										PG_ICU_LIB(sss->locale)->errorName(status))));
 				}
 				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+					bsize = PG_ICU_LIB(sss->locale)->getSortKey(PG_ICU_COL(sss->locale),
+																uchar, ulen,
+																(uint8_t *) sss->buf2, sss->buflen2);
 			}
 			else
 #endif
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..236ec6d682 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversion;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -404,6 +405,9 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datctype, &isnull);
 	Assert(!isnull);
 	ctype = TextDatumGetCString(datum);
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	collversion = isnull ? NULL : TextDatumGetCString(datum);
 
 	if (pg_perm_setlocale(LC_COLLATE, collate) == NULL)
 		ereport(FATAL,
@@ -424,7 +428,20 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
+		if (make_icu_collator(iculocale, collversion, &default_locale))
+		{
+			ereport(WARNING,
+					errmsg("database \"%s\" has a collation version mismatch",
+						   name),
+					errdetail("The database was created using ICU locale \"%s\" version %s, "
+							  "but no ICU library with a matching collator is available",
+							  iculocale, collversion),
+					errhint("Install a version of ICU that provides locale \"%s\" "
+							"version %s, or rebuild all objects "
+							"in this database that use the default collation and run "
+							"ALTER DATABASE %s REFRESH COLLATION VERSION.",
+							iculocale, collversion, name));
+		}
 	}
 	else
 		iculocale = NULL;
@@ -443,32 +460,29 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (dbform->datlocprovider == COLLPROVIDER_LIBC && collversion)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
-
-		collversionstr = TextDatumGetCString(datum);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
+		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
 				 "database \"%s\" has no actual collation version, but a version was recorded",
 				 name);
-		else if (strcmp(actual_versionstr, collversionstr) != 0)
+		else if (strcmp(actual_versionstr, collversion) != 0)
+		{
 			ereport(WARNING,
 					(errmsg("database \"%s\" has a collation version mismatch",
 							name),
 					 errdetail("The database was created using collation version %s, "
 							   "but the operating system provides version %s.",
-							   collversionstr, actual_versionstr),
+							   collversion, actual_versionstr),
 					 errhint("Rebuild all objects in this database that use the default collation and run "
 							 "ALTER DATABASE %s REFRESH COLLATION VERSION, "
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
+		}
 	}
 
 	/* Make the locale settings visible as GUC variables, too */
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 349dd6a537..5fd08f7693 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3941,7 +3941,45 @@ struct config_string ConfigureNamesString[] =
 	},
 
 	{
-		{"krb_server_keyfile", PGC_SIGHUP, CONN_AUTH_AUTH,
+		{"icu_library_path", PGC_SUSET, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Sets the path for dynamically loadable ICU libraries."),
+			gettext_noop("If versions of ICU other than the one that "
+						 "PostgreSQL is linked against are needed, they will "
+						 "be opened from this directory.  If empty, the "
+						 "system linker search path will be used."),
+			GUC_SUPERUSER_ONLY
+		},
+		&icu_library_path,
+		"",
+		NULL, NULL, NULL
+	},
+
+	{
+		{"icu_library_versions", PGC_SUSET, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Sets the available ICU library versions."),
+			gettext_noop("A comma-separated list of major or major.minor ICU versions "
+						 "that will be searched for referenced collation versions.  Use * "
+						 "for all possible versions."),
+			GUC_SUPERUSER_ONLY
+		},
+		&icu_library_versions,
+		"*",
+		NULL, NULL, NULL
+	},
+
+	{
+		{"default_icu_library_version", PGC_SIGHUP, COMPAT_OPTIONS_PREVIOUS,
+			gettext_noop("Sets the ICU library version used to create new collations and databases."),
+			gettext_noop("A major or major.minor ICU version, or empty string for the linked version."),
+			GUC_SUPERUSER_ONLY
+		},
+		&default_icu_library_version,
+		"",
+		NULL, NULL, NULL
+	},
+
+	{
+		{"krb_server_keyfile", PGC_POSTMASTER, CONN_AUTH_AUTH,
 			gettext_noop("Sets the location of the Kerberos server key file."),
 			NULL,
 			GUC_SUPERUSER_ONLY
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351..93a0ad6406 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -727,6 +727,16 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_library_path = ''			# path for dynamically loaded ICU
+					# libraries
+#icu_library_versions = '*'		# comma-separated list of ICU major
+					# or major.minor versions to make
+					# available, or * for all major
+					# versions that can be found
+#default_icu_library_version = ''	# version of ICU to use for new
+					# databases and collations, defaults
+					# to the latest version
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..607968b340 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11733,6 +11733,29 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '8888', descr => 'get available ICU library versions',
+  proname => 'pg_icu_library_versions', prorettype => 'record',
+  procost => '10',
+  prorows => '2',
+  proretset => 't',
+  provolatile => 'v',
+  proargtypes => '',
+  proallargtypes => '{text,text,text}',
+  proargmodes => '{o,o,o}',
+  proargnames => '{icu_version,unicode_version,cldr_version}',
+  prosrc => 'pg_icu_library_versions' },
+{ oid => '8889', descr => 'get available ICU collation versions',
+  proname => 'pg_icu_collation_versions', prorettype => 'record',
+  procost => '10',
+  prorows => '2',
+  proretset => 't',
+  provolatile => 'v',
+  proargtypes => 'text',
+  proallargtypes => '{text,text,text,text}',
+  proargmodes => '{i,o,o,o}',
+  proargnames => '{locale,icu_version,uca_version,collator_version}',
+  prosrc => 'pg_icu_collation_versions' },
+
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..554b335df9 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,7 @@
 #endif
 #ifdef USE_ICU
 #include <unicode/ucol.h>
+#include <unicode/ubrk.h>
 #endif
 
 #ifdef USE_ICU
@@ -40,6 +41,9 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT char *icu_library_path;
+extern PGDLLIMPORT char *icu_library_versions;
+extern PGDLLIMPORT char *default_icu_library_version;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -63,6 +67,78 @@ extern struct lconv *PGLC_localeconv(void);
 
 extern void cache_locale_time(void);
 
+#ifdef USE_ICU
+
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*open) (const char *loc, UErrorCode *status);
+	void		(*close) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	struct pg_icu_library *next;
+} pg_icu_library;
+
+#endif
 
 /*
  * We define our own wrapper around locale_t so we can keep the same
@@ -84,17 +160,24 @@ struct pg_locale_struct
 		{
 			const char *locale;
 			UCollator  *ucol;
+			pg_icu_library *lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+#ifdef USE_ICU
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+#endif
+
 typedef struct pg_locale_struct *pg_locale_t;
 
 extern PGDLLIMPORT struct pg_locale_struct default_locale;
 
-extern void make_icu_collator(const char *iculocstr,
+extern bool make_icu_collator(const char *iculocstr,
+							  const char *collversion,
 							  struct pg_locale_struct *resultp);
 
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
diff --git a/src/test/icu/meson.build b/src/test/icu/meson.build
index 5a4f53f37f..ac2672190e 100644
--- a/src/test/icu/meson.build
+++ b/src/test/icu/meson.build
@@ -5,6 +5,7 @@ tests += {
   'tap': {
     'tests': [
       't/010_database.pl',
+      't/020_multiversion.pl',
     ],
     'env': {'with_icu': icu.found() ? 'yes' : 'no'},
   },
diff --git a/src/test/icu/t/020_multiversion.pl b/src/test/icu/t/020_multiversion.pl
new file mode 100644
index 0000000000..c04df4c65d
--- /dev/null
+++ b/src/test/icu/t/020_multiversion.pl
@@ -0,0 +1,274 @@
+# Copyright (c) 2022, PostgreSQL Global Development Group
+#
+# If one or more extra ICU versions is installed in the standard system library
+# search path, this test will detect them and run.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+my $node1 = PostgreSQL::Test::Cluster->new('node1');
+$node1->init;
+$node1->start;
+
+# Check which ICU versions are installed.
+my $highest_version = $node1->safe_psql('postgres', 'select max(icu_version::decimal) from pg_icu_library_versions()');
+my $lowest_version = $node1->safe_psql('postgres', 'select min(icu_version::decimal) from pg_icu_library_versions()');
+my $highest_major_version = int($highest_version);
+my $lowest_major_version = int($lowest_version);
+
+if ($highest_major_version == $lowest_major_version)
+{
+	$node1->stop;
+	plan skip_all => 'no extra ICU library versions found';
+}
+
+sub set_default_icu_library_version
+{
+	my $icu_version = shift;
+	$node1->safe_psql('postgres', "alter system set default_icu_library_version = '$icu_version'; select pg_reload_conf()");
+}
+
+sub set_icu_library_versions
+{
+	my $icu_versions = shift;
+	$node1->safe_psql('postgres', "alter system set icu_library_versions = '$icu_versions'");
+	$node1->restart;
+}
+
+my $ret;
+my $stderr;
+
+# === DATABASE objects ===
+
+# ===== Scenario 1: user creates database with all default settings
+
+$node1->safe_psql('postgres', "create database db2 locale_provider = icu template = template0 icu_locale = 'en'");
+
+# No warning when logging into this database.
+$ret = $node1->psql('db2', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# ===== Scenario 2: user wants to use an old library
+
+# Create a database using the older library by changing the default.  This
+# might be done for compatibility with some other system, but it also simulates
+# a database that was created with all default settings when the binary was
+# linked against the older version.
+set_default_icu_library_version($lowest_major_version);
+$node1->safe_psql('postgres', "create database db3 locale_provider = icu template = template0 icu_locale = 'en'");
+
+isnt($node1->safe_psql('postgres', "select datcollversion from pg_database where datname = 'db2'"),
+     $node1->safe_psql('postgres', "select datcollversion from pg_database where datname = 'db3'"),
+     'db2 and db3 should have different datcollversion');
+
+# ===== Scenario 3: user has the old library avaliable, is happy to keep using it
+
+# Unset the default ICU library version (meaning use the linked version for
+# newly created databases).  No warning, because we can still find that older
+# version via dlopen().  User can happily go on using that old version in this
+# database for the rest of time.
+set_default_icu_library_version("");
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# ===== Scenario 4: user doesn't have the old library, installs after warnings
+
+# Hide the old library version.  This simulates a system that doesn't have that
+# version installed yet, by making it unavailable.  We get a warning.
+set_icu_library_versions("$highest_major_version");
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING:  database "db3" has a collation version mismatch/, "warning for incorrect datcollversion");
+like($stderr, qr/HINT:  Install a version of ICU that provides/, "warning suggests installing another ICU version");
+
+# Make the old version available again, this time explicitly (whereas before it
+# worked becuase the default is * which would find it automatically).
+set_icu_library_versions("$lowest_major_version,$highest_major_version");
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# It also works if you use a major.minor version explicitly.
+set_icu_library_versions("$lowest_version,$highest_major_version");
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# Or *, the default value that we started with.
+set_icu_library_versions("*");
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# ===== Scenario 5: user doesn't have the old library, rebuilds after warnings
+
+# Hide the old library version again, and we get the warning again.
+set_icu_library_versions("$highest_major_version");
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING:  database "db3" has a collation version mismatch/, "warning for incorrect datcollversion");
+like($stderr, qr/HINT:  Install a version of ICU that provides/, "warning suggests installing another ICU version");
+
+# If we don't want to install a new library, we have the option of clobbering
+# the version.  It's the administrator's job to rebuild any database objects
+# that depend on the collation (most interestingly indexes) before doing so.
+# In this scenario, the REFRESH command can be run before or *after* rebuilding
+# indexes, because either way we're already using the default ICU library (due
+# to failure to find the named version).
+$ret = $node1->psql('postgres', "alter database db3 refresh collation version", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/NOTICE:  changing version/, "version changes");
+
+# Now no warning.
+$ret = $node1->psql('db3', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning after refresh");
+
+# ===== Scenario 6: user has the old library, but eventually decides to rebuild/upgrade
+
+# Make a new database with the old version active
+set_default_icu_library_version($lowest_major_version);
+$node1->safe_psql('postgres', "create database db4 locale_provider = icu template = template0 icu_locale = 'en'");
+
+# No warning, it just load the old version.
+$ret = $node1->psql('db4', "select", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning after refresh");
+my $old_datcollversion = $node1->safe_psql('postgres', "select datcollversion from pg_database where datname = 'db4'");
+
+# The user would now like to upgrade to the new library.  Presumably people
+# will want to do this eventually to avoid running very old unmaintained copies
+# of ICU.  Unlike scenario 3, here it's actually a requirement to REFRESH
+# *before* doing all the rebuilds of indexes etc, which may be a little
+# confusing (not shown here).  REFRESH is necessary to change datcollversion,
+# which is required to make us start opening the newer library.
+#
+# XXX Currently you also need to reconnect all sessions too, because the
+# default locale is cached and now out of date.
+set_default_icu_library_version("");
+$ret = $node1->psql('postgres', "alter database db4 refresh collation version", stderr => \$stderr);
+my $new_datcollversion = $node1->safe_psql('postgres', "select datcollversion from pg_database where datname = 'db4'");
+
+isnt($old_datcollversion, $new_datcollversion, "datcollversion changed");
+
+
+# === COLLATION objects ===
+
+# The same scenarios, this time with COLLATIONs.
+
+# ===== Scenario 1: user creates database with all default settings
+
+set_default_icu_library_version("");
+set_icu_library_versions("*");
+$node1->safe_psql('postgres', "create collation c1 (provider = icu, locale = 'en')");
+
+# No warning when using it.
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c1", stderr => \$stderr);
+is($ret, 0, "can use collation");
+unlike($stderr, qr/WARNING/, "no warning for default");
+
+# ===== Scenario 2: user wants to use an old library
+
+# Simulates a collation in a database that migrated from an older binary, or a
+# collation set up explicitly to match some other system.
+set_default_icu_library_version($lowest_major_version);
+$node1->safe_psql('postgres', "create collation c2 (provider = icu, locale = 'en')");
+
+isnt($node1->safe_psql('postgres', "select collversion from pg_collation where collname = 'c1'"),
+     $node1->safe_psql('postgres', "select collversion from pg_collation where collname = 'c2'"),
+     'c1 and c2 should have different collversion');
+
+# ===== Scenario 3: user has the old library avaliable, is happy to keep using it
+
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "can use collation");
+unlike($stderr, qr/WARNING/, "no warning when using old library collation");
+
+# ===== Scenario 4: user doesn't have the old library, installs after warnings
+
+# Hide the old library version.
+set_default_icu_library_version("");
+set_icu_library_versions("$highest_major_version");
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING:  collation "c2" version mismatch/, "warning for incorrect collversion");
+like($stderr, qr/HINT:  Install a version of ICU that provides/, "warning suggests installing another ICU version");
+
+# Make the old version available again.
+set_icu_library_versions("$lowest_major_version,$highest_major_version");
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# It also works if you use a major.minor version explicitly.
+set_icu_library_versions("$lowest_version,$highest_major_version");
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# Or *, the default value that we started with.
+set_icu_library_versions("*");
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# ===== Scenario 5: user doesn't have the old library, rebuilds after warnings
+
+# Hide the old library version again.
+set_icu_library_versions("$highest_major_version");
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/WARNING:  collation "c2" version mismatch/, "warning for incorrect collversion");
+like($stderr, qr/HINT:  Install a version of ICU that provides/, "warning suggests installing another ICU version");
+
+# Rebuild things, and refresh.
+$ret = $node1->psql('postgres', "alter collation c2 refresh version", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/NOTICE:  changing version/, "version changes");
+
+# Now no warning.
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c2", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+# ===== Scenario 6: user has the old library, but eventually decides to rebuild/upgrade
+
+set_default_icu_library_version($lowest_major_version);
+$node1->safe_psql('postgres', "create collation c3 (provider = icu, locale = 'en')");
+
+# No warning.
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c3", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+my $old_collversion = $node1->safe_psql('postgres', "select collversion from pg_collation where collname = 'c3'");
+
+# Rebuild things, and refresh.  As with database scenario 6, we need to refresh
+# *before* rebuilding dependent objects (not shown here).
+set_default_icu_library_version("");
+$ret = $node1->psql('postgres', "alter collation c3 refresh version", stderr => \$stderr);
+is($ret, 0, "success");
+like($stderr, qr/NOTICE:  changing version/, "version changes");
+
+# No warning.
+$ret = $node1->psql('postgres', "select 'x' < 'y' collate c3", stderr => \$stderr);
+is($ret, 0, "success");
+unlike($stderr, qr/WARNING/, "no warning");
+
+my $new_collversion = $node1->safe_psql('postgres', "select collversion from pg_collation where collname = 'c3'");
+
+isnt($old_collversion, $new_collversion, "collversion changed");
+
+$node1->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2f5802195d..50d9558cf4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1101,6 +1101,7 @@ HeapTupleTableSlot
 HistControl
 HotStandbyState
 I32
+ICU_Convert_BI_Func
 ICU_Convert_Func
 ID
 INFIX
@@ -2852,10 +2853,12 @@ TypeName
 U
 U32
 U8
+UBreakIterator
 UChar
 UCharIterator
 UColAttribute
 UColAttributeValue
+UCollationResult
 UCollator
 UConverter
 UErrorCode
@@ -3482,6 +3485,7 @@ pg_funcptr_t
 pg_gssinfo
 pg_hmac_ctx
 pg_hmac_errno
+pg_icu_library
 pg_int64
 pg_local_to_utf_combined
 pg_locale_t
-- 
2.38.1

v8-0002-ci-XXX-install-ICU63-on-debian.patchtext/x-patch; charset=US-ASCII; name=v8-0002-ci-XXX-install-ICU63-on-debian.patchDownload

From aa9ac4f827f83abd4efe7f773efa2e2f45ad7640 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 26 Nov 2022 14:39:18 +1300
Subject: [PATCH v8 2/2] ci: XXX install ICU63 on debian

This is not a good way to add the package, just doing this temporarily
as a demonstration via cfbot.
---
 .cirrus.yml | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/.cirrus.yml b/.cirrus.yml
index f31923333e..8c1fb63cad 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -296,6 +296,10 @@ task:
   setup_additional_packages_script: |
     #apt-get update
     #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+    # this is debian 11 (bullseye) but we can install ICU 63 from debian 10 (buster)
+    curl -O http://ftp.debian.org/debian/pool/main/i/icu/libicu63_63.1-6+deb10u3_amd64.deb
+    dpkg -i libicu63_63.1-6+deb10u3_amd64.deb
+
 
   matrix:
     - name: Linux - Debian Bullseye - Autoconf
-- 
2.38.1

#99

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#98)

Re: Collation version tracking for macOS

On Sat, Nov 26, 2022 at 6:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:

This is just a first cut, but enough to try out and see if we like it,
what needs to be improved, what edge cases we haven't thought about
etc. Let me know what you think.

BTW one problem to highlight (mentioned but buried in the test
comments), is that REFRESH VERSION doesn't affect other sessions or
even the current session. You have to log out and back in again to
pick up the new version. Obviously that's not good enough, but fixing
that involves making it transactional, I think. If you abort, we have
to go back to using the old version, if you commit you keep the new
version and we might also consider telling other backends to start
using the new version -- or something like that. I think that's just
a Small Matter of Programming, but a little bit finickity and I need
to take a break for a bit and go work on bugs elsewhere, hence v8
didn't address that yet.

#100

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Thomas Munro (#94)

Re: Collation version tracking for macOS

On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:

OK. Time for a new list of the various models we've discussed so far:

1. search-by-collversion: We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.

2. lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today.

3. lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion). Similar
to the above, but it's a separate property and the provider is always
ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
ICU_VERSION.

4. lib-version-in-locale: "63:en" from earlier versions. That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work. It doesn't sound like anyone really likes
this.

5. lib-version-in-collversion: We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion.

I'd like to vote against #3 at least in the form that's described
here. If we had three more libraries providing collations, it's likely
that they would need versioning, too. So if we add an explicit notion
of provider version, then it ought not to be specific to libicu.

I think it's OK to decide that different library versions are
different providers (your option #2), or that they are the same
provider but give rise to different collations (your option #4), or
that there can be multiple version of each collation which are
distinguished by some additional provider version field (your #3 made
more generic).

I don't really understand #1 or #5 well enough to have an educated
opinion, but I do think that #1 seems a bit magical. It hopes that the
combination of a collation name and a datcollversion will be
sufficient to find exactly one matcing collation in a list of provided
libraries. The advantage of that, as I understand it, is that if you
do something to your system that causes the number of matches to go
from one to zero, you can just throw another library on the pile and
get the number back up to one. Woohoo! But there's a part of me that
worries: what if the number goes up to two, and they're not all the
same? Probably that's something that shouldn't happen, but if it does
then I think there's kind of no way to fix it. With the other options,
if there's some way to jigger the catalog state to match what you want
to happen, you can always repair the situation somehow, because the
library to be used for each collation is explicitly specified in some
way, and you just have to get it to match what you want to have
happen.

I don't know too much about this, though, so I might have it all wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com

#101

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#98)

Re: Collation version tracking for macOS

On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:

Here's the first iteration.

I will send a full review shortly, but I encountered an ICU bug along
the way, which caused me some confusion for a bit. I'll skip past the
various levels of confusion I had (burned a couple hours), and get
right to the repro:

Install the latest release of all major versions 50-69, and compile
postgres against 70. You'll get:

=# select * from pg_icu_collation_versions('en_US') order by
icu_version;
icu_version | uca_version | collator_version
-------------+-------------+------------------
50.2 | 6.2 | 58.0.6.50
51.3 | 6.2 | 58.0.6.50
52.2 | 6.2 | 58.0.6.50
53.2 | 6.3 | 137.51
54.2 | 7.0 | 137.56
55.2 | 7.0 | 153.56
56.2 | 8.0 | 153.64
57.2 | 8.0 | 153.64
58.3 | 9.0 | 153.72
59.2 | 9.0 | 153.72
60.3 | 10.0 | 153.80
61.2 | 10.0 | 153.80
62.2 | 11.0 | 153.88
63.2 | 11.0 | 153.88
64.2 | 12.1 | 153.97
65.1 | 12.1 | 153.97
66.1 | 13.0 | 153.14
67.1 | 13.0 | 153.14
68.2 | 13.0 | 153.14
69.1 | 13.0 | 153.14
70.1 | 14.0 | 153.112
(21 rows)

This is good information, because it tells us that major library
versions change more often than collation versions, empirically-
speaking.

But did you notice that the version went backwards from 65.1 -> 66.1?
Well, actually, it didn't. The version of that collation in 66.1 went
from 153.97 -> 153.104. But there's a bug in versionToString() that
does the decimal output incorrectly when there's a '0' digit between
the hundreds and the ones place. I'll see about reporting that, but I
thought I'd mention it here because it could have consequences, as we
are storing the strings :-(

The bug is still present in 70.1, but it's masked because it went to
.112.

Incidentally, this answers our other question about whether the
collation version can change in a minor version update. Perhaps not,
but if they fix this bug and backport it, then the version *string*
will change in a minor update. Ugh.

Regards,
Jeff Davis

#102

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Jeff Davis (#101)

Re: Collation version tracking for macOS

On Mon, Nov 28, 2022 at 9:55 PM Jeff Davis <pgsql@j-davis.com> wrote:

But did you notice that the version went backwards from 65.1 -> 66.1?
Well, actually, it didn't. The version of that collation in 66.1 went
from 153.97 -> 153.104. But there's a bug in versionToString() that
does the decimal output incorrectly when there's a '0' digit between
the hundreds and the ones place. I'll see about reporting that, but I
thought I'd mention it here because it could have consequences, as we
are storing the strings :-(

The bug is still present in 70.1, but it's masked because it went to
.112.

Incidentally, this answers our other question about whether the
collation version can change in a minor version update. Perhaps not,
but if they fix this bug and backport it, then the version *string*
will change in a minor update. Ugh.

That is ... astonishingly bad.

--
Robert Haas
EDB: http://www.enterprisedb.com

#103

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Robert Haas (#102)

Re: Collation version tracking for macOS

On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:

That is ... astonishingly bad.

https://unicode-org.atlassian.net/browse/CLDR-16175

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#104

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#101)

Re: Collation version tracking for macOS

On Tue, Nov 29, 2022 at 3:55 PM Jeff Davis <pgsql@j-davis.com> wrote:

=# select * from pg_icu_collation_versions('en_US') order by
icu_version;
icu_version | uca_version | collator_version
-------------+-------------+------------------
50.2 | 6.2 | 58.0.6.50
51.3 | 6.2 | 58.0.6.50
52.2 | 6.2 | 58.0.6.50
53.2 | 6.3 | 137.51
54.2 | 7.0 | 137.56
55.2 | 7.0 | 153.56
56.2 | 8.0 | 153.64
57.2 | 8.0 | 153.64
58.3 | 9.0 | 153.72
59.2 | 9.0 | 153.72
60.3 | 10.0 | 153.80
61.2 | 10.0 | 153.80
62.2 | 11.0 | 153.88
63.2 | 11.0 | 153.88
64.2 | 12.1 | 153.97
65.1 | 12.1 | 153.97
66.1 | 13.0 | 153.14
67.1 | 13.0 | 153.14
68.2 | 13.0 | 153.14
69.1 | 13.0 | 153.14
70.1 | 14.0 | 153.112
(21 rows)

This is good information, because it tells us that major library
versions change more often than collation versions, empirically-
speaking.

Wow, nice discovery about 104 -> 14. Yeah, I imagine we'll want some
kind of band-aid to tolerate that exact screwup and avoid spurious
warnings.

Bugs aside, that's quite a revealing table in other ways. We can see:

* The version scheme changed completely in ICU 53. This corresponds
to a major rewrite of the collation code, I see[1]https://icu.unicode.org/design/collation/v2.

* The first component seems to be (UCOL_RUNTIME_VERSION << 4) + 9.
UCOL_RUNTIME_VERSION is in their uvernum.h, currently 9, was 8, bumped
between 54 and 55 (I see this in their commit log), corresponding to
the two possible numbers 137 and 153 that we see there. I don't know
where the final 9 term is coming from but it looks stable since the v2
collation rewrite landed.

* The second component seems to be uca_version_major * 8 +
uca_version_minor (that's the Unicode Collation Algorithm version, and
so far always matches the Unicode version, visible in the output of
the other function).

* The values you showed for English don't have a third component, but
if you try some other locales like 'zh' you'll see the CLDR major
version in third position. So I guess some locales depend on CLDR
data and others don't.

TL;DR it *looks* like the set of ingredients for the version string is:

* UCOL_RUNTIME_VERSION (rarely changes)
* UCA/Unicode major.minor version
* sometimes CLDR major version, not sure when
* 9

[1]: https://icu.unicode.org/design/collation/v2

#105

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Robert Haas (#100)

Re: Collation version tracking for macOS

On Mon, 2022-11-28 at 14:11 -0500, Robert Haas wrote:

I don't really understand #1 or #5 well enough to have an educated
opinion, but I do think that #1 seems a bit magical. It hopes that
the
combination of a collation name and a datcollversion will be
sufficient to find exactly one matcing collation in a list of
provided
libraries. The advantage of that, as I understand it, is that if you
do something to your system that causes the number of matches to go
from one to zero, you can just throw another library on the pile and
get the number back up to one. Woohoo! But there's a part of me that
worries: what if the number goes up to two, and they're not all the
same? Probably that's something that shouldn't happen, but if it does
then I think there's kind of no way to fix it. With the other
options,
if there's some way to jigger the catalog state to match what you
want
to happen, you can always repair the situation somehow, because the
library to be used for each collation is explicitly specified in some
way, and you just have to get it to match what you want to have
happen.

Not necessarily, #2-4 (at least as implemented in v7) can only load one
major version at a time, so can't specify minor versions:
/messages/by-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.camel@j-davis.com

With #1, you can provide control over the search order to find the
symbol you want. Granted, if you want to specify that different
collations look in different libraries for the same version, then it
won't work, because the search order is global -- is that what you're
worried about? If so, I think we need to compare it against the
downsides of #2-4, which in my opinion are more serious.

The first thing to sort out with options #2-4 is: what about minor
versions? V7 took the approach that only the major version matters.
That means that if you want to select a specific minor version, then
you are out of luck, because only one major at a time can be loaded,
globally. But paying attention to minor versions seems like a mess --
we'd need even more magical fallbacks that try later minor versions or
something.

Second, there is weirdness in the common case that a collation version
doesn't change between versions. Let's say you have a collation
"mycoll" with locale "en_US" and it's pointed at built-in library
version 64, with collation version 153.97. GUC
default_icu_library_version is set to 63. Then you upgrade the system
and ICU gets updated from 64 -> 65. Now, it can't find version 64 to
load, so it falls back to 63 (which has the wrong version 153.88), even
though 65 is just fine because it still offers that locale with version
153.97. (A similar problem exists when you remove a version of ICU from
icu_library_path, and another version suffices for all of your
collations.)

Thirdly, as I said earlier, it's just hard on the user to try to sort
out two different versions modeled in the database. Understanding
encodings and collations are hard enough, and then we introduce *two*
versions on top of that.

Fourth, I don't see what the point of ucol_getVersion() is in schemes
#2-4. All it does is control a WARNING, because throwing an error (at
least by default) would be too harsh, given that users have lived with
these risks for so long. But if all it does is throw a warning, what's
the point in modeling it in the catalog as though it's the most
important version?

Ultimately, I think collation version (as reported by
ucol_getVersion()) is the most accurate and least-surprising way to
match a library-provided collation with the collation in the catalog.
And it seems like we'd be using it in exactly the way the ICU
maintainers intend it to be used.

Of course, I cast my vote for #1 before I discovered this ICU bug
here:
/messages/by-id/0f7922d4f411376f420ec9139febeae4cdc748a6.camel@j-davis.com

That injects some doubt, to be sure. If I were to try to solve the
problems with #2-4, one approach might be to treat the built-in ICU
version differently from the ones in icu_library_path. Not quite sure,
I'd have to think more. But as of now, I'd still lean toward #1 until a
better option is presented.

Regards,
Jeff Davis

#106

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#98)

1 attachment(s)

Re: Collation version tracking for macOS

On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:

On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com>
wrote:

I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to
find
the collation version among whatever libraries the admin can
provide;
but adding some observability into the search should mitigate any
confusion.

OK, it sounds like I should code that up next.

Here's the first iteration.

Thank you.

Proposed changes:

* I attached a first pass of some documentation.

* Should be another GUC to turn WARNING into an ERROR. Useful at least
for testing; perhaps too dangerous for production.

* The libraries should be loaded in a more diliberate order. The "*"
should be expanded in a descending fashion so that later versions are
preferred.

* GUCs should be validated.

* Should validate that loaded library has expected version.

* We need to revise or remove pg_collation_actual_version() and
pg_database_collation_actual_version().

* The GUCs are PGC_SUSET, but don't take effect because
icu_library_list_fully_loaded is never reset.

* The extra collations you're adding at bootstrap time are named based
on the library major version. I suppose it might be more "proper" to
name them based on the collation version, but that would be more
verbose, so I won't advocate for that. Just pointing it out.

* It looks hard (or impossible) to mix multiple ICU libraries with the
same major version and different minor versions. That's because,
e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
and when you install ICU 63.2, those dependencies get clobbered with
the 63.2 versions. That fails the sanity check I proposed above about
the library version number matching the requested library version
number. And it also just seems wrong -- why would you have minor-
version precision about an ICU library but then only major-version
precision about the ICU dependencies of that library? Doesn't that
defeat the whole purpose of this naming scheme? (Maybe another ICU
bug?).

Minor comments:

* ICU_I18N is defined in make_icu_library_name() but used outside of
it. One solution might be to have it return both library names to the
caller and rename it as make_icu_library_names().

* get_icu_function() could use a clarifying comment or a better name.
Something that communicates that you are looking for the function in
the given library with the given major version number (which may or may
not be needed depending on how the library was compiled).

* typo in comment over make_icu_collator:
s/u_getVersion/ucol_getVersion/

* The return value of make_icu_collator() seems backwards to me,
stylistically. I typically see the false-is-good pattern with integer
returns.

* weird bracketing style in get_icu_collator for the "else"

The version rosetta stone functions look like this:

postgres=# select * from pg_icu_library_versions();
icu_version | unicode_version | cldr_version
-------------+-----------------+--------------
67.1        | 13.0            | 37.0
63.1        | 11.0            | 34.0
57.1        | 8.0             | 29.0
(3 rows)

postgres=# select * from pg_icu_collation_versions('zh');
icu_version | uca_version | collator_version
-------------+-------------+------------------
67.1        | 13.0        | 153.14.37
63.1        | 11.0        | 153.88.34
57.1        | 8.0         | 153.64.29
(3 rows)

I like these functions.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

doc.patchtext/x-patch; charset=UTF-8; name=doc.patchDownload

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..b9dba8ac67 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1047,6 +1047,50 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
     </tip>
    </sect3>
   </sect2>
+  <sect2 id="collation-versions">
+   <title>Collation Versions</title>
+
+   <para>
+    Collations are sensitive to the specific collation version, which is
+    obtained from the collation provider library at the time the collation is
+    created (and only updated with <xref linkend="sql-altercollation"/>). If
+    the collation provider library is updated on the system (e.g. due to an
+    operating system upgrade), it may provide a different collation version;
+    but the version recorded in <productname>PostgreSQL</productname> will
+    remain unchanged.
+   </para>
+   <para>
+    New collation versions are generally desirable, as they reflect changes in
+    natural language over time. But these ordering changes can also cause
+    problems, such as the inconsistency of an indexes, which often depend on a
+    stable ordering. If <productname>PostgreSQL</productname> is unable to
+    find a collation in the collation provider that matches the recorded
+    version exactly, it will emit a <literal>WARNING</literal> (configurable
+    with <xref linkend="guc-collation-version-mismatch-error"/>).
+   </para>
+   <sect3 id="multiple-icu-libraries">
+    <title>Multiple ICU collation provider libraries</title>
+    <para>
+     When using the <literal>icu</literal> collation provider,
+     <productname>PostgreSQL</productname> can load multiple ICU collation
+     provider libraries, making it possible to find an exact match for the
+     collation version even if the operating system's ICU library has been
+     upgraded and provides a new collation version.
+    </para>
+    <para>
+     To use additional ICU collation provider libraries, set <xref
+     linkend="guc-icu-library-path"/> to the directory where the ICU libraries
+     are installed, and use <xref linkend="guc-icu-library-versions"/> to
+     control how those libraries are searched.
+    </para>
+    <note>
+     <para>
+      The <literal>libc</literal> collation provider does not allow specifying
+      multiple libraries.
+     </para>
+    </note>
+   </sect3>
+  </sect2>
  </sect1>
 
  <sect1 id="multibyte">
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9fd2075b1e..3809c26b31 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10288,6 +10288,107 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-compatible-collation">
+     <title>Collation Version Compatibility</title>
+     <variablelist>
+     <varlistentry id="guc-collation-version-mismatch-error">
+      <term><varname>collation_version_mismatch_error</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>collation_version_mismatch_error</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        If <productname>PostgreSQL</productname> detects mismatched collation
+        versions, and this variable is set to <literal>true</literal>, an
+        error will be raised. If this variable is set to
+        <literal>false</literal>, a warning will be raised instead. The
+        default is <literal>false</literal>.
+       </para>
+       <para>
+        See <xref linkend="collation-versions"/> for more information.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-icu-library-path">
+      <term><varname>icu_library_path</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>icu_library_path</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Set to the directory where additional ICU libraries are installed, to
+        be searched for matching collation versions.
+       </para>
+       <para>
+        See <xref linkend="multiple-icu-libraries"/> for more information.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-icu-library-versions">
+      <term><varname>icu_library_versions</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>icu_library_versions</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When searching for a matching collation version, search the ICU
+        libraries with the version numbers specified in this setting,
+        separated by commas. By default, only the built-in ICU library is
+        searched.
+       </para>
+       <para>
+        Library version numbers can be specified as either
+        <literal>major_version</literal> or
+        <literal>major_version.minor_version</literal>. By default, the
+        built-in ICU library is used.
+       </para>
+       <para>
+        If this variable is set to <literal>*</literal>,
+        <productname>PostgreSQL</productname> will attempt to load any ICU
+        library in <literal>icu_library_path</literal> with a compatible major
+        version.
+       </para>
+       <para>
+        See <xref linkend="multiple-icu-libraries"/> for more information.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-default-icu-library-version">
+      <term><varname>default_icu_library_version</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>default_icu_library_version</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        If <productname>PostgreSQL</productname> detects mismatched collation
+        versions, use the collation provided by the ICU library with this
+        version number.
+       </para>
+       <para>
+        Library version numbers can be specified as either
+        <literal>major_version</literal> or
+        <literal>major_version.minor_version</literal>. By default, the
+        built-in ICU library is used.
+       </para>
+       <para>
+        Has no effect if <literal>collation_version_mismatch_error</literal>
+        is set to <literal>true</literal>.
+       </para>
+       <para>
+        See <xref linkend="multiple-icu-libraries"/> for more information.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
     <sect2 id="runtime-config-compatible-clients">
      <title>Platform and Client Compatibility</title>
      <variablelist>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 68cd4297d2..a9f6258e77 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27180,6 +27180,39 @@ postgres=# SELECT * FROM pg_walfile_name_offset((pg_backup_stop()).lsn);
         Use of this function is restricted to superusers.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_icu_library_versions</primary>
+        </indexterm>
+        <function>pg_icu_library_versions</function> ()
+        <returnvalue>setof record</returnvalue>
+        ( <parameter>icu_version</parameter> <type>text</type>,
+        <parameter>unicode_version</parameter> <type>text</type>,
+        <parameter>cldr_version</parameter> <type>text</type>) )
+       </para>
+       <para>
+        Returns information from each available ICU library.
+       </para></entry>
+      </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_icu_collation_versions</primary>
+        </indexterm>
+        <function>pg_icu_collation_versions</function> ( <parameter>locale</parameter> <type>text</type> )
+        <returnvalue>setof record</returnvalue>
+        (<parameter>icu_version</parameter> <type>text</type>,
+        <parameter>uca_version</parameter> <type>text</type>,
+        <parameter>collator_version</parameter> <type>text</type> )
+       </para>
+       <para>
+        Returns the collation version of the given locale from each available
+        ICU library.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>

#107

Joe Conway

mail@joeconway.com

about 3 years ago

In reply to: Robert Haas (#100)

Re: Collation version tracking for macOS

On 11/28/22 14:11, Robert Haas wrote:

On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:

OK. Time for a new list of the various models we've discussed so far:

1. search-by-collversion: We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.

2. lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today.

3. lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion). Similar
to the above, but it's a separate property and the provider is always
ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
ICU_VERSION.

4. lib-version-in-locale: "63:en" from earlier versions. That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work. It doesn't sound like anyone really likes
this.

5. lib-version-in-collversion: We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion.

I'd like to vote against #3 at least in the form that's described
here. If we had three more libraries providing collations, it's likely
that they would need versioning, too. So if we add an explicit notion
of provider version, then it ought not to be specific to libicu.

+many

I think it's OK to decide that different library versions are
different providers (your option #2), or that they are the same
provider but give rise to different collations (your option #4), or
that there can be multiple version of each collation which are
distinguished by some additional provider version field (your #3 made
more generic).

I think provider and collation version are distinct concepts. The
provider ('c' versus 'i' for example) determines a unique code path in
the backend due to different APIs, whereas collation version is related
to a specific ordering given a set of characters.

I don't really understand #1 or #5 well enough to have an educated
opinion, but I do think that #1 seems a bit magical. It hopes that the
combination of a collation name and a datcollversion will be
sufficient to find exactly one matcing collation in a list of provided
libraries. The advantage of that, as I understand it, is that if you
do something to your system that causes the number of matches to go
from one to zero, you can just throw another library on the pile and
get the number back up to one. Woohoo! But there's a part of me that
worries: what if the number goes up to two, and they're not all the
same? Probably that's something that shouldn't happen, but if it does
then I think there's kind of no way to fix it. With the other options,
if there's some way to jigger the catalog state to match what you want
to happen, you can always repair the situation somehow, because the
library to be used for each collation is explicitly specified in some
way, and you just have to get it to match what you want to have
happen.

My vote is for something like #5. The collversion should indicate a
specific immutable ordering behavior.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#108

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Jeff Davis (#105)

Re: Collation version tracking for macOS

On Mon, Nov 28, 2022 at 11:49 PM Jeff Davis <pgsql@j-davis.com> wrote:

Not necessarily, #2-4 (at least as implemented in v7) can only load one
major version at a time, so can't specify minor versions:
/messages/by-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.camel@j-davis.com

With #1, you can provide control over the search order to find the
symbol you want. Granted, if you want to specify that different
collations look in different libraries for the same version, then it
won't work, because the search order is global -- is that what you're
worried about? If so, I think we need to compare it against the
downsides of #2-4, which in my opinion are more serious.

You know more about this than I do, for sure, so don't let my vote
back the project into a bad spot. But, yeah, the thing you mention
here is what I'm worried about. Without a way to force a certain
behavior for a certain particular collation, you don't have an escape
valve if the global library ordering isn't doing what you want. Your
argument seems to at least partly be that #1 will be more usable on
the whole, and that does seem like an important consideration. People
may have a lot of collations and adjusting them all individually could
be difficult and unpleasant. However, I think it's also worth asking
what options someone has if #1 can't be made to work due to a single
ordering controlling every collation.

It's entirely possible that the scenario I'm worried about is too
remote in practice to be concerned about. I don't know how this stuff
works well enough to be certain. It's just that, on the basis of
previous experience, (1) it's not that uncommon for people to actually
end up in situations that we thought shouldn't ever happen and (2)
code that deals with collations is more untrustworthy than average.

--
Robert Haas
EDB: http://www.enterprisedb.com

#109

Jeremy Schneider

schneider@ardentperf.com

about 3 years ago

In reply to: Jeff Davis (#101)

Re: Collation version tracking for macOS

On 11/28/22 6:54 PM, Jeff Davis wrote:

=# select * from pg_icu_collation_versions('en_US') order by
icu_version;
icu_version | uca_version | collator_version
-------------+-------------+------------------
...
67.1 | 13.0 | 153.14
68.2 | 13.0 | 153.14
69.1 | 13.0 | 153.14
70.1 | 14.0 | 153.112
(21 rows)

This is good information, because it tells us that major library
versions change more often than collation versions, empirically-
speaking.

It seems to me that the collator_version field is not a good version
identifier to use.

Just taking a quick glance at the ICU home page right now, it shows that
all of the last 5 versions of ICU have included "additions and
corrections" to locale data itself, including 68 to 69 where the
collator version did not change.

Is it possible that this "collator_version" only reflects the code that
processes collation data to do comparisons/sorts, but it does not
reflect updates to the locale data itself?

https://icu.unicode.org/

ICU v72 -> CLDR v42
ICU v71 -> CLDR v41
ICU v70 -> CLDR v40
ICU v69 -> CLDR v39
ICU v68 -> CLDR v38

-Jeremy

--
http://about.me/jeremy_schneider

#110

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeremy Schneider (#109)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

It seems to me that the collator_version field is not a good version
identifier to use.

Just taking a quick glance at the ICU home page right now, it shows that
all of the last 5 versions of ICU have included "additions and
corrections" to locale data itself, including 68 to 69 where the
collator version did not change.

Is it possible that this "collator_version" only reflects the code that
processes collation data to do comparisons/sorts, but it does not
reflect updates to the locale data itself?

I think it also includes the CLDR version for *some* locales. From a
quick look, that includes 'ar', 'ru', 'tr', 'zh'. Jeff, would you
mind sharing the same table for one of those? Perhaps 'en' really
does depend only on the UCA?

#111

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Robert Haas (#108)

Re: Collation version tracking for macOS

On Tue, 2022-11-29 at 12:32 -0500, Robert Haas wrote:

You know more about this than I do, for sure, so don't let my vote
back the project into a bad spot.

I'm going back and forth myself. I haven't found a great answer here
yet.

But, yeah, the thing you mention
here is what I'm worried about. Without a way to force a certain
behavior for a certain particular collation, you don't have an escape
valve if the global library ordering isn't doing what you want.

One bit of weirdness is that I may have found another ICU problem.
First, install 63.1, and you get (editing for clarity):

$ ls -l /path/to/libicui18n.so.63*
/path/to/libicui18n.so.63 -> libicui18n.so.63.1
/path/to/libicui18n.so.63.1

$ ls -l /path/to/libicuuc.so.63*
/path/to/libicuuc.so.63 -> libicuuc.so.63.1
/path/to/libicuuc.so.63.1

$ ls -l /path/to/libicudata.so.63*
/path/to/libicudata.so.63 -> libicudata.so.63.1
/path/to/lib/libicudata.so.63.1

$ ldd /path/to/libicui18n.so.63.1
libicuuc.so.63 => /path/to/libicuuc.so.63
libicudata.so.63 => /path/to/libicudata.so.63

OK, now install 63.2. Then you get:

$ ls -l /path/to/libicui18n.so.63*
/path/to/libicui18n.so.63 -> libicui18n.so.63.2
/path/to/libicui18n.so.63.1
/path/to/libicui18n.so.63.2

$ ls -l /path/to/libicuuc.so.63*
/path/to/libicuuc.so.63 -> libicuuc.so.63.2
/path/to/libicuuc.so.63.1
/path/to/libicuuc.so.63.2

$ ls -l /path/to/libicudata.so.63*
/path/to/libicudata.so.63 -> libicudata.so.63.2
/path/to/libicudata.so.63.1
/path/to/libicudata.so.63.2

$ ldd /path/to/libicui18n.so.63.2
libicuuc.so.63 => /path/to/libicuuc.so.63
libicudata.so.63 => /path/to/libicudata.so.63

The problem is that the specific minor version 63.1 depends on only the
major version of its ICU link dependencies. When loading
libicui18n.so.63.1, you are actually pulling in libicuuc.so.63.2 and
libicudata.so.63.2.

When I tried this with Thomas's patch, it caused some confusing
problems. I inserted a check that, when you open a library, that the
requested and reported versions match, and the check failed when
multiple minors are installed. In other words, opening
libicui18n.so.63.1 reports a version of 63.2!

(Note: I compiled ICU with --enable-rpath, but I don't think it
matters.)

Summary: even locking down to a minor version does not seem to identify
a specific ICU library, because its shared library dependencies do not
reference a specific minor version.

It's entirely possible that the scenario I'm worried about is too
remote in practice to be concerned about. I don't know how this stuff
works well enough to be certain. It's just that, on the basis of
previous experience, (1) it's not that uncommon for people to
actually
end up in situations that we thought shouldn't ever happen and (2)
code that deals with collations is more untrustworthy than average.

Yeah...

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#112

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Joe Conway (#107)

Re: Collation version tracking for macOS

On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote:

My vote is for something like #5. The collversion should indicate a
specific immutable ordering behavior.

Easier said than done:

/messages/by-id/abddc35a7a447d93e2b8371a1a9052cb48866070.camel@j-davis.com

Even pointing at a specific minor version doesn't guarantee that
specific ICU code is loaded. It could also be a mix of different minor
versions that happen to be installed.

But if we ignore that problem for a moment, and assume that major
version is precise enough, let me make another proposal (not advocating
for this, but wanted to put it out there):

6. Create a new concept of a "locked down collation" that points at
some specific collation code (identified by some combination of library
version and collation version or whatever else can be used to identify
it). If a collation is locked down, it would never have a fallback or
any other magic, it would either find the code it's looking for, or
fail. If a collation is not locked down, it would look only in the
built-in ICU library, and warn if it detects some kind of change
(again, by whatever heuristic we think is reasonable).

#6 doesn't answer all of the problems I pointed out earlier:

/messages/by-id/83faecb4a89dfb5794938e7b4d9f89daf4c5d631.camel@j-davis.com

but could be a better starting place for answers.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#113

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#110)

Re: Collation version tracking for macOS

On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote:

On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider
<schneider@ardentperf.com> wrote:

It seems to me that the collator_version field is not a good
version
identifier to use.

Just taking a quick glance at the ICU home page right now, it shows
that
all of the last 5 versions of ICU have included "additions and
corrections" to locale data itself, including 68 to 69 where the
collator version did not change.

Is it possible that this "collator_version" only reflects the code
that
processes collation data to do comparisons/sorts, but it does not
reflect updates to the locale data itself?

I think it also includes the CLDR version for *some* locales. From a
quick look, that includes 'ar', 'ru', 'tr', 'zh'. Jeff, would you
mind sharing the same table for one of those? Perhaps 'en' really
does depend only on the UCA?

=# select * from pg_icu_collation_versions('ar') order by icu_version;
icu_version | uca_version | collator_version
-------------+-------------+------------------
50.2 | 6.2 | 58.0.0.50
51.3 | 6.2 | 58.0.0.50
52.2 | 6.2 | 58.0.0.50
53.2 | 6.3 | 137.51.25
54.2 | 7.0 | 137.56.26
55.2 | 7.0 | 153.56.27.1
56.2 | 8.0 | 153.64.28
57.2 | 8.0 | 153.64.29
58.3 | 9.0 | 153.72.30.3
59.2 | 9.0 | 153.72.31.1
60.3 | 10.0 | 153.80.32.1
61.2 | 10.0 | 153.80.33
62.2 | 11.0 | 153.88.33.8
63.2 | 11.0 | 153.88.34
64.2 | 12.1 | 153.97.35.8
65.1 | 12.1 | 153.97.36
66.1 | 13.0 | 153.14.36.8
67.1 | 13.0 | 153.14.37
68.2 | 13.0 | 153.14.38.8
69.1 | 13.0 | 153.14.39
70.1 | 14.0 | 153.112.40
(21 rows)

=# select * from pg_icu_collation_versions('zh') order by icu_version;
icu_version | uca_version | collator_version
-------------+-------------+------------------
50.2 | 6.2 | 58.0.0.50
51.3 | 6.2 | 58.0.0.50
52.2 | 6.2 | 58.0.0.50
53.2 | 6.3 | 137.51.25
54.2 | 7.0 | 137.56.26
55.2 | 7.0 | 153.56.27.1
56.2 | 8.0 | 153.64.28
57.2 | 8.0 | 153.64.29
58.3 | 9.0 | 153.72.30.3
59.2 | 9.0 | 153.72.31.1
60.3 | 10.0 | 153.80.32.1
61.2 | 10.0 | 153.80.33
62.2 | 11.0 | 153.88.33.8
63.2 | 11.0 | 153.88.34
64.2 | 12.1 | 153.97.35.8
65.1 | 12.1 | 153.97.36
66.1 | 13.0 | 153.14.36.8
67.1 | 13.0 | 153.14.37
68.2 | 13.0 | 153.14.38.8
69.1 | 13.0 | 153.14.39
70.1 | 14.0 | 153.112.40
(21 rows)

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#114

Joe Conway

mail@joeconway.com

about 3 years ago

In reply to: Jeff Davis (#112)

Re: Collation version tracking for macOS

On 11/29/22 13:59, Jeff Davis wrote:

On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote:

My vote is for something like #5. The collversion should indicate a
specific immutable ordering behavior.

Easier said than done:
/messages/by-id/abddc35a7a447d93e2b8371a1a9052cb48866070.camel@j-davis.com

Even pointing at a specific minor version doesn't guarantee that
specific ICU code is loaded. It could also be a mix of different minor
versions that happen to be installed.

I understand that it is not easily done, but if the combination of
collprovider + collversion does not represent specific immutable
ordering behavior for a given locale, what value is there in tracking it?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#115

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#103)

Re: Collation version tracking for macOS

On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote:

On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:

That is ... astonishingly bad.

https://unicode-org.atlassian.net/browse/CLDR-16175

Oops, reported in CLDR instead of ICU. Moved to:

https://unicode-org.atlassian.net/browse/ICU-22215

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#116

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#111)

Re: Collation version tracking for macOS

On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:

One bit of weirdness is that I may have found another ICU problem.

Reported as:

https://unicode-org.atlassian.net/browse/ICU-22216

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#117

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#113)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 8:03 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote:

I think it also includes the CLDR version for *some* locales. From a
quick look, that includes 'ar', 'ru', 'tr', 'zh'. Jeff, would you
mind sharing the same table for one of those? Perhaps 'en' really
does depend only on the UCA?

=# select * from pg_icu_collation_versions('ar') order by icu_version;
icu_version | uca_version | collator_version
-------------+-------------+------------------
50.2 | 6.2 | 58.0.0.50
51.3 | 6.2 | 58.0.0.50
52.2 | 6.2 | 58.0.0.50
53.2 | 6.3 | 137.51.25
54.2 | 7.0 | 137.56.26
55.2 | 7.0 | 153.56.27.1
56.2 | 8.0 | 153.64.28
57.2 | 8.0 | 153.64.29
58.3 | 9.0 | 153.72.30.3
59.2 | 9.0 | 153.72.31.1
60.3 | 10.0 | 153.80.32.1
61.2 | 10.0 | 153.80.33
62.2 | 11.0 | 153.88.33.8
63.2 | 11.0 | 153.88.34
64.2 | 12.1 | 153.97.35.8
65.1 | 12.1 | 153.97.36
66.1 | 13.0 | 153.14.36.8
67.1 | 13.0 | 153.14.37
68.2 | 13.0 | 153.14.38.8
69.1 | 13.0 | 153.14.39
70.1 | 14.0 | 153.112.40
(21 rows)

Thanks. So now we can see that the CLDR minor version is there too.
At a guess, in ICU 60 and before, it was the 4th component directly,
and from ICU 61 on, it's shifted left 3 bits. I guess that means
those CLDR-dependent locales have higher frequency collversion
changes, including everyday "apt-get upgrade" (no major OS upgrade
required), assuming that Debian et al take those minor upgrades, while
others like 'en' should be stable for the whole ICU major version's
lifetime, and even across some ICU major version upgrades, because the
Unicode/UCA version changes more slowly.

Those CLDR-dependent locales therefore present us with a problem: as
discussed a while back, it's impossible to install two minor versions
of the same ICU major version with packages, and as Jeff has pointed
out in recent emails, even if you compile them yourself (which no one
really expects users to do), it doesn't really work because the
SONAMEs only have the major version, so the various libraries
that make up ICU will not be able to open each other correctly
(they'll follow symlinks to an arbitrary minor version). (These two
things are not unrelated.) So I probably need to remove the code that
claimed to support minor version addressing and go back to the
previous thinking that major will have to be enough.

In terms of user experience, I think that might mean that users of
'zh' who encounter warnings after a minor upgrade would therefore
really only have the options of REFRESHing and rebuilding, or
downgrading the package, because there's no way for us to access the
older version. Users of 'en' probably only encounter collversion
changes when moving between OS releases with an ICU major version
change, and then the various schemes in this thread can help them
avoid the need to rebuild, until they eventually want to, if ever.

#118

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Jeff Davis (#112)

Re: Collation version tracking for macOS

On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis <pgsql@j-davis.com> wrote:

6. Create a new concept of a "locked down collation" that points at
some specific collation code (identified by some combination of library
version and collation version or whatever else can be used to identify
it). If a collation is locked down, it would never have a fallback or
any other magic, it would either find the code it's looking for, or
fail. If a collation is not locked down, it would look only in the
built-in ICU library, and warn if it detects some kind of change
(again, by whatever heuristic we think is reasonable).

It seems like it would be somewhat reasonable to allow varying levels
of specificity in saying which what suffix to append when calling
dlopen() on the ICU library. Like you could allow adding nothing,
which would find the system-default ICU, or you could add 53 to find
the default version of ICU 53, or you could 53.1 to pick a specific
minor version. The idea is that the symlinks in the filesystem would
be responsible for sorting out the meaning of the supplied string. The
way that minor versions work may preclude having this work as well as
one might hope, though.

I continue to be confused about why collation maintainers think that
it's OK to whack stuff around in minor versions. The thought that
people might use collations to sort data that needs to stay sorted
after upgrading the library seems to be an alien one, and it doesn't
really seem like libicu is a whole lot better than libc, either.

--
Robert Haas
EDB: http://www.enterprisedb.com

#119

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Robert Haas (#118)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 8:52 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis <pgsql@j-davis.com> wrote:

6. Create a new concept of a "locked down collation" that points at
some specific collation code (identified by some combination of library
version and collation version or whatever else can be used to identify
it). If a collation is locked down, it would never have a fallback or
any other magic, it would either find the code it's looking for, or
fail. If a collation is not locked down, it would look only in the
built-in ICU library, and warn if it detects some kind of change
(again, by whatever heuristic we think is reasonable).

It seems like it would be somewhat reasonable to allow varying levels
of specificity in saying which what suffix to append when calling
dlopen() on the ICU library. Like you could allow adding nothing,
which would find the system-default ICU, or you could add 53 to find
the default version of ICU 53, or you could 53.1 to pick a specific
minor version. The idea is that the symlinks in the filesystem would
be responsible for sorting out the meaning of the supplied string. The
way that minor versions work may preclude having this work as well as
one might hope, though.

I'm struggling to understand what's new about proposal #6. The
earlier proposals except #1 already contemplated different levels of
locked-down-ness. For example in the libversion-as-provider idea, we
said you could use just provider = ICU (warn me if the collverison
changes, but always use the "default" library and carry on, pretty
much like today except perhaps "the default" can be changed with a
GUC), or you could be more specific and say provider = ICU63. (We
also mentioned ICU63_2 as a third level of specificity, but maybe
that's practically impossible.) And it was the same for the other
ideas, just encoded in different ways.

#120

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Joe Conway (#114)

Re: Collation version tracking for macOS

On Tue, 2022-11-29 at 14:34 -0500, Joe Conway wrote:

I understand that it is not easily done, but if the combination of
collprovider + collversion does not represent specific immutable
ordering behavior for a given locale

Given the u_versionToString() bug, we know the version string could end
up being the same between two different collation versions (e.g.
153.104 and 153.14). So that really undermines the credibility of ICU's
collation versions (at least the strings, which is what we store in
collversion).

But if we ignore that bug, do we have evidence that the actual versions
could be the same for collations that sort differently? It's worth
exploring, to be sure, but right now I don't know of a case.

, what value is there in tracking [collation version]?

Similarly, what is the value in tracking the library minor versions, if
when you open libicui18n.63.1, you may end up with a mix of code
between 63.1 and 63.2?

That doesn't mean it's impossible. We could attach collations to a
library major version, and tell administrators that once they install a
major version in icu_library_path, they never touch that major version
again (no updates or new minors, only new majors). #6 might be a good
approach to facilitate this best practice. We'd then probably need to
change collversion to be a library major version, and then come up with
a migration path from 15 -> 16. Or we could store both library major
version and collversion, and verify both.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#121

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#117)

Re: Collation version tracking for macOS

On Wed, 2022-11-30 at 08:41 +1300, Thomas Munro wrote:

In terms of user experience, I think that might mean that users of
'zh' who encounter warnings after a minor upgrade would therefore
really only have the options of REFRESHing and rebuilding, or
downgrading the package, because there's no way for us to access the
older version. Users of 'en' probably only encounter collversion
changes when moving between OS releases with an ICU major version
change, and then the various schemes in this thread can help them
avoid the need to rebuild, until they eventually want to, if ever.

I installed the first minor release for each major, and got some new
tables. I think we can all agree that it's a lot easier to work with
information once it's in table form.

Here's what I found for the 'ar' locale (firstminor/lastminor are the
icu library versions, firstcollversion/lastcollversion are their
respective collation versions for the given locale):

firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
60.1 | 60.3 | 153.80.32 | 153.80.32.1
64.1 | 64.2 | 153.96.35 | 153.97.35.8
68.1 | 68.2 | 153.14.38 | 153.14.38.8
(3 rows)

For 'en':

firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
64.1 | 64.2 | 153.96 | 153.97
(1 row)

And for 'zh':

It looks like collation versions do change in minor releases. It looks
like it's *not* safe to lock a collation to a major version *if* that
major version could be updated to a new minor. And we can't lock to a
minor, as I said earlier. Therefore, once we lock a collation down to a
major release, we better keep that in the icu_library_path, and never
touch it, and never install a new minor for that major.

Then again, maybe some of these are just about how the version is
reported... maybe 153.80.32 and 153.80.32.1 are really the same
version? But 64.1 -> 64.2 looks like a real difference.

I suppose the next step is to test with actual data and find
differences?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#122

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#121)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:

Here's what I found for the 'ar' locale (firstminor/lastminor are the
icu library versions, firstcollversion/lastcollversion are their
respective collation versions for the given locale):

firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
60.1 | 60.3 | 153.80.32 | 153.80.32.1
64.1 | 64.2 | 153.96.35 | 153.97.35.8
68.1 | 68.2 | 153.14.38 | 153.14.38.8
(3 rows)

Right, this fits with what I said earlier: the third component is CLDR
major, fourth component is CLDR minor except from ICU 61 on the CLDR
minor is << 3'd (X.X.38.8 means CLDR 38.1). I wrote something about
that particular CLDR upgrade that happened in ICU 68 back here, with a
link to the CLDR change list:

/messages/by-id/CA+hUKGJxg6AbKC9RJ7r1ByVLtvVkThQV+RZO6BKVWYESPCp3Ug@mail.gmail.com

TL;DR that particular CLDR change didn't actually affect collations,
it affected other locale stuff we don't care about (timezones etc).
We probably have to assume that any CLDR change *might* affect us,
though, unless we can find a written policy somewhere that says CLDR
minor changes never change sort order. But I wouldn't want to get
into 2nd guessing their ucol_getVersion() format, and if they knew
that minor changes didn't affect sort order they presumably wouldn't
have included it in the recipe, so I think we simply have to treat it
as opaque and assume that ucol_getVersion() change means what it says
on the tin: sort order might have changed.

I suppose the next step is to test with actual data and find
differences?

Easier to read the published CLDR deltas, but I'm not sure it'd tell
us much about what *could* happen in future releases...

#123

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#119)

Re: Collation version tracking for macOS

On Wed, 2022-11-30 at 09:00 +1300, Thomas Munro wrote:

I'm struggling to understand what's new about proposal #6.

Perhaps it's just a slight variant; I'm not sure. It's not a complete
proposal yet.

The difference I had in mind is that it would treat the built-in ICU
differently from what is found in icu_library_path. I think that could
remove confusion over what happens when you upgrade the system's ICU
library.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#124

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#116)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:

One bit of weirdness is that I may have found another ICU problem.

Reported as:

https://unicode-org.atlassian.net/browse/ICU-22216

I'm no expert on loader/linker arcana but I have a feeling this is a
dead end. It's an ancient Unix or at least elf-era Unix convention
that SONAMEs have major versions only, because major versions are the
basis of ABI stability.

As a workaround with an already built ICU, I think you could use elf
editing tools like "patchelf" to change the SONAME and DT_NEEDED to
include the minor version. Or you could convince the build/link
scripts to set them that way in the first place, but no distro would
want to do that as it would cause lots of executables to fail to load
when the next ICU minor comes out.

#125

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#124)

Re: Collation version tracking for macOS

On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote:

On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:

One bit of weirdness is that I may have found another ICU
problem.

Reported as:

https://unicode-org.atlassian.net/browse/ICU-22216

I'm no expert on loader/linker arcana but I have a feeling this is a
dead end. It's an ancient Unix or at least elf-era Unix convention
that SONAMEs have major versions only, because major versions are the
basis of ABI stability.

It's possible that it's more a problem of how they are doing it: the
specific version is coming from a dependency rather than the library
itself. The results are surprising, so I figured it's worth a report.
Let's see what they say.

Regardless, even if they did make a change, it's not going to help us
anytime soon. We can't rely on any scheme that involves multiple minor
versions for a single major version being installed at once. That means
that, if you create a collation depending on ICU X.Y, and then it gets
upgraded to X.(Y+1), and you create another collation depending on that
library version, you are stuck.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#126

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#122)

Re: Collation version tracking for macOS

On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:

On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:

Here's what I found for the 'ar' locale (firstminor/lastminor are
the
icu library versions, firstcollversion/lastcollversion are their
respective collation versions for the given locale):

firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
60.1       | 60.3      | 153.80.32        | 153.80.32.1
64.1       | 64.2      | 153.96.35        | 153.97.35.8
68.1       | 68.2      | 153.14.38        | 153.14.38.8
(3 rows)

Right, this fits with what I said earlier: the third component is
CLDR
major, fourth component is CLDR minor except from ICU 61 on the CLDR
minor is << 3'd (X.X.38.8 means CLDR 38.1).

What about 64.1 -> 64.2? That changed the *second* component from 96 ->
97. Are we agreed that collations can materially change in minor ICU
releases?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#127

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#126)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 1:32 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:

On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote:

Here's what I found for the 'ar' locale (firstminor/lastminor are
the
icu library versions, firstcollversion/lastcollversion are their
respective collation versions for the given locale):

firstminor | lastminor | firstcollversion | lastcollversion
------------+-----------+------------------+-----------------
60.1 | 60.3 | 153.80.32 | 153.80.32.1
64.1 | 64.2 | 153.96.35 | 153.97.35.8
68.1 | 68.2 | 153.14.38 | 153.14.38.8
(3 rows)

Right, this fits with what I said earlier: the third component is
CLDR
major, fourth component is CLDR minor except from ICU 61 on the CLDR
minor is << 3'd (X.X.38.8 means CLDR 38.1).

What about 64.1 -> 64.2? That changed the *second* component from 96 ->
97. Are we agreed that collations can materially change in minor ICU
releases?

That means that the Unicode/UCA version switched from 12 to 12.1, so
that's a confirmed sighting of a UCA minor version bump within one ICU
major version. Let's see what the purpose of that Unicode minor
release was[1]https://unicode.org/versions/Unicode12.1.0/:

"Unicode 12.1 adds exactly one character, for a total of 137,929 characters.

The new character added to Version 12.1 is:

U+32FF SQUARE ERA NAME REIWA

Version 12.1 adds that single character to enable software to be
rapidly updated to support the new Japanese era name in calendrical
systems and date formatting. The new Japanese era name was officially
announced on April 1, 2019, and is effective as of May 1, 2019."

Wow!

Wikipedia says[2]https://en.wikipedia.org/wiki/Reiwa "the "rei" character 令 has never appeared before".

The sort order of characters that didn't previously exist is a special
topic. In theory they can't hurt you because you shouldn't have been
using them, but PostgreSQL doesn't enforce that (other systems do), so
you could be exposed to a change from whatever default ordering the
non-existent codepoint had for random implementation reasons to some
deliberate ordering which may or may not be the same.

Are all Unicode/UCA minor versions of that type? I dunno. Something
to research, but [3]https://www.unicode.org/versions/#major_minor is far too vague and [4]https://www.unicode.org/policies/stability_policy.html is about other problems.

[1]: https://unicode.org/versions/Unicode12.1.0/
[2]: https://en.wikipedia.org/wiki/Reiwa
[3]: https://www.unicode.org/versions/#major_minor
[4]: https://www.unicode.org/policies/stability_policy.html

#128

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#125)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 1:25 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote:

On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
https://unicode-org.atlassian.net/browse/ICU-22216

I'm no expert on loader/linker arcana but I have a feeling this is a
dead end. It's an ancient Unix or at least elf-era Unix convention
that SONAMEs have major versions only, because major versions are the
basis of ABI stability.

It's possible that it's more a problem of how they are doing it: the
specific version is coming from a dependency rather than the library
itself. The results are surprising, so I figured it's worth a report.
Let's see what they say.

Regardless, even if they did make a change, it's not going to help us
anytime soon. We can't rely on any scheme that involves multiple minor
versions for a single major version being installed at once. That means
that, if you create a collation depending on ICU X.Y, and then it gets
upgraded to X.(Y+1), and you create another collation depending on that
library version, you are stuck.

Mainstream package maintainers aren't going to let that happen anyway
as discussed, so this would always be a fairly specialised concern.
Maybe someone in our community would be motivated to publish a repo
full of mutant packages that don't conflict with each other and that
have specially modified DT_NEEDED, or are rolled into one single
library so the DT_NEEDED problem goes away.

#129

Michael Paquier

michael@paquier.xyz

about 3 years ago

In reply to: Thomas Munro (#127)

Re: Collation version tracking for macOS

On Wed, Nov 30, 2022 at 01:50:51PM +1300, Thomas Munro wrote:

The new character added to Version 12.1 is:

U+32FF SQUARE ERA NAME REIWA

Version 12.1 adds that single character to enable software to be
rapidly updated to support the new Japanese era name in calendrical
systems and date formatting. The new Japanese era name was officially
announced on April 1, 2019, and is effective as of May 1, 2019."

Wow!

Wow++. I didn't know this one.

Wikipedia says[2] "the "rei" character 令 has never appeared before".

At least there was some time ahead to prepare for the switch from "平
成" to "令和". Things were much "funnier" when the era has switched
from "昭和" to "平成", as the sudden death of the emperor has required
Japan to switch to a new calendar very suddenly back in the day..
I've heard this was quite a mess for folks in IT back then, especially
for public agencies.
--
Michael

#130

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 3 years ago

In reply to: Jeff Davis (#115)

Re: Collation version tracking for macOS

Jeff Davis <pgsql@j-davis.com> writes:

On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote:

On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:

That is ... astonishingly bad.

https://unicode-org.atlassian.net/browse/CLDR-16175

Oops, reported in CLDR instead of ICU. Moved to:

https://unicode-org.atlassian.net/browse/ICU-22215

Out of morbid curiosity I went source diving, and the culprit is this
bit (which will also break if a version component ever goes above 999):

/* write the decimal field value */
field=versionArray[part];
if(field>=100) {
*versionString++=(char)('0'+field/100);
field%=100;
}
if(field>=10) {
*versionString++=(char)('0'+field/10);
field%=10;
}
*versionString++=(char)('0'+field);

(https://sources.debian.org/src/icu/72.1-3/source/common/putil.cpp#L2308)

because apparently snprintf() is too hard?

- ilmari

#131

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Jeff Davis (#106)

Re: Collation version tracking for macOS

On Tue, Nov 29, 2022 at 7:51 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:

On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com>
wrote:

I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to
find
the collation version among whatever libraries the admin can
provide;
but adding some observability into the search should mitigate any
confusion.

OK, it sounds like I should code that up next.

Here's the first iteration.

Thank you.

Thanks for the review. Responses further down. And thanks also for
the really interesting discussion about how the version numbers work
(or in some cases, don't work...), and practical packaging and linking
problems.

To have a hope of making something happen for PG16, which I think
means we need a serious contender patch in the next few weeks, we
really need to make some decisions. I enjoyed trying out
search-by-collversion, but it's still not my favourite. On the ballot
we have two main questions:

1. Should we commit to search-by-collversion, or one of the explicit
library version ideas, and if the latter, which?
2. Should we try to support being specific about minor versions (in
various different ways according to the choice made for #1)?

My tentative votes are:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while still
making it super clear what is going on. This would include adding DDL
commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
to make warnings go way.

2. I think we should ignore minor versions for now (other than
reporting them in the relevant introspection functions), but not make
any choices that would prevent us from changing our mind about that in
a later release. For example, having two levels of specificity ICU
and ICU68 in the libver-in-provider-name design wouldn't preclude us
from adding support for ICU68_2 later

I haven't actually tried that design out in code yet, but I'm willing
to try to code that up very soon. So no new patch from me yet. Does
anyone else want to express a view?

Proposed changes:

* I attached a first pass of some documentation.

Thanks. Looks pretty good, and much of it would stay if we changed to
one of the other models.

* Should be another GUC to turn WARNING into an ERROR. Useful at least
for testing; perhaps too dangerous for production.

OK, will add that into the next version.

* The libraries should be loaded in a more diliberate order. The "*"
should be expanded in a descending fashion so that later versions are
preferred.

Yeah, I agree.

* GUCs should be validated.

Will do.

* Should validate that loaded library has expected version.

Will do.

* We need to revise or remove pg_collation_actual_version() and
pg_database_collation_actual_version().

I never liked that use of the word "actual"...

* The GUCs are PGC_SUSET, but don't take effect because
icu_library_list_fully_loaded is never reset.

True. Just rought edges because I was trying to prototype
search-by-collversion fast. Will consider this for the next version.

* The extra collations you're adding at bootstrap time are named based
on the library major version. I suppose it might be more "proper" to
name them based on the collation version, but that would be more
verbose, so I won't advocate for that. Just pointing it out.

Ah, yes, the ones with names like "en-US-x-icu68". I agree that made
a little less sense in the search-by-collversion patch. Maybe we
wouldn't want these at all in the search-by-collversion model. But I
think they're perfect the way they are in the provider = ICU68 model.
The other idea I considered ages ago was that we could use namespaces:
you could "icu68.en-US", or just "en-US" in some contexts to get what
your search path sees, but that all seemed a little too cute and not
really like anything else we do with system-created catalogues, so I
gave that idea up.

* It looks hard (or impossible) to mix multiple ICU libraries with the
same major version and different minor versions. That's because,
e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
and when you install ICU 63.2, those dependencies get clobbered with
the 63.2 versions. That fails the sanity check I proposed above about
the library version number matching the requested library version
number. And it also just seems wrong -- why would you have minor-
version precision about an ICU library but then only major-version
precision about the ICU dependencies of that library? Doesn't that
defeat the whole purpose of this naming scheme? (Maybe another ICU
bug?).

I don't think it's a bug exactly. That scheme is designed to
advertise ABI stability, and not intended to support parallel
installation of minor versions. It does seem a little silly for
libraries that are shipped together as one atomic unit not to use
fully qualified dependency names, though.

I think there would be various technical solutions, if you're prepared
to give up existing ready-made packages and build stuff yourself.
Install them into different directories with different DT_RPATH so
they can't see each other (but then our icu_library_path needs to
support a list of paths or it won't find these ones which will have to
be not in the usual system path), or clobber the DT_NEEDED (but I
guess not the DT_SONAME) to mention the minor version, and equivalent
concepts for other non-elf systems (at a glance the same problem
applies on macOS), or re-roll the libraries into a single .so. Or
convince them to support a single library build mode (maybe there is
one already? I couldn't find it).

That's all a bit against the grain for now, and makes me want to
abandon the notion of minor versions completely for now but leave the
option open for later exploration.

In the meantime, I think the feature is still pretty useful. For
example, it helps you with the common case of a major OS upgrade or
streaming replication across major OS versions: just find the right
.deb/rpm/whatever for the older one, and install it, until you're
ready to upgrade and REFRESH. The story is not quite as good for
someone with an index full of Chinese or Turkish text who gets a
surprise warning after a minor apt-get update, because the Japanese
have decided to invent a new character. We can't offer a nice
solution to that: they have to determine that it is safe to REFRESH to
clear the warning, with or without rebuild, or downgrade/pin the ICU
package until they are ready to REFRESH. But that is already the case
today and this patch neither helps nor hinders. The only reason we
didn't know about this pre-existing type of problem is because
(approximately) nobody uses ICU yet, because it wasn't available as a
database default yet.

Minor comments:

* ICU_I18N is defined in make_icu_library_name() but used outside of
it. One solution might be to have it return both library names to the
caller and rename it as make_icu_library_names().

Good idea, will do.

* get_icu_function() could use a clarifying comment or a better name.
Something that communicates that you are looking for the function in
the given library with the given major version number (which may or may
not be needed depending on how the library was compiled).

Agreed.

* typo in comment over make_icu_collator:
s/u_getVersion/ucol_getVersion/

Thanks.

* The return value of make_icu_collator() seems backwards to me,
stylistically. I typically see the false-is-good pattern with integer
returns.

Agreed.

* weird bracketing style in get_icu_collator for the "else"

Yep.

The version rosetta stone functions look like this:

postgres=# select * from pg_icu_library_versions();
icu_version | unicode_version | cldr_version
-------------+-----------------+--------------
67.1 | 13.0 | 37.0
63.1 | 11.0 | 34.0
57.1 | 8.0 | 29.0
(3 rows)

postgres=# select * from pg_icu_collation_versions('zh');
icu_version | uca_version | collator_version
-------------+-------------+------------------
67.1 | 13.0 | 153.14.37
63.1 | 11.0 | 153.88.34
57.1 | 8.0 | 153.64.29
(3 rows)

I like these functions.

Yeah, they've been quite educational. Now I'm wondering what form
these functions would take in the provider = ICU68 patch.

#132

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Thomas Munro (#131)

Re: Collation version tracking for macOS

On Sun, Dec 4, 2022 at 10:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:

My tentative votes are:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while still
making it super clear what is going on. This would include adding DDL
commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
to make warnings go way.

+1. I wouldn't lose any sleep if we picked a different non-magical
option, but I think this is probably my favorite of the
explicit-library-version options (though it is close) and I like it
better than search-by-collversion.

(It's possible that I'm wrong to like it better, but I do.)

2. I think we should ignore minor versions for now (other than
reporting them in the relevant introspection functions), but not make
any choices that would prevent us from changing our mind about that in
a later release. For example, having two levels of specificity ICU
and ICU68 in the libver-in-provider-name design wouldn't preclude us
from adding support for ICU68_2 later

+1.

--
Robert Haas
EDB: http://www.enterprisedb.com

#133

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#131)

Re: Collation version tracking for macOS

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

CREATE COLLATION PROVIDER icu63 TYPE icu
AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

It avoids a lot of the annoyances we're discussing, and puts the power
in the hands of the admin. If they want to allow minor version updates,
they specify the library with .so.63, and let the symlinking handle it.

Of course, we can still do some sanity checks (WARNINGs or ERRORs) when
we think something is going wrong; like the version of ICU is too new,
or the reported version (ucol_getVersion()) doesn't match what's in
collversion. But we basically get out of the business of understanding
ICU versioning and leave that up to the administrator.

It's easier to document, and would require fewer GUCs (if any). And it
avoids mixing version information from another project into our data
model.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#134

Joe Conway

mail@joeconway.com

about 3 years ago

In reply to: Jeff Davis (#133)

Re: Collation version tracking for macOS

On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

CREATE COLLATION PROVIDER icu63 TYPE icu
AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

I like this. The provider kind defines which path we take in our code,
and the specific library unambiguously defines a specific collation
behavior (I think, ignoring bugs?)

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#135

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Joe Conway (#134)

Re: Collation version tracking for macOS

On Tue, Dec 6, 2022 at 6:45 AM Joe Conway <mail@joeconway.com> wrote:

On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

CREATE COLLATION PROVIDER icu63 TYPE icu
AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

+1

I like this. The provider kind defines which path we take in our code,
and the specific library unambiguously defines a specific collation
behavior (I think, ignoring bugs?)

OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.

#136

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Thomas Munro (#135)

5 attachment(s)

Re: Collation version tracking for macOS

On Tue, 2022-12-06 at 10:33 +1300, Thomas Munro wrote:

OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.

I've been hacking on a major refactor of the locale-related code. I
attached my progress and I think several patches are ready.

The main motivation is that I was frustrated by the special cases
everywhere. I wanted to make it easier to hack on this code going
forward, now that we are adding even more complexity for multiple ICU
libraries.

I'm posting to this thread (rather than my previous refactoring
thread[1]/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com) because a lot of the people interested in working on this
code are here. So, if you like (or don't like) the structure of these
changes, please let me know.

Changes:
* Introduce pg_locale_internal.h to hide all USE_ICU code,
including all callers of the ICU routines. The files that
still need to include pg_locale_internal.h are:
- pg_locale.c
- regc_pg_locale.c
- formatting.c
- like.c
- like_support.c
- collationcmds.c
* Other callers (in files that don't include
pg_locale_internal.h) don't need to branch based on the
provider, platform, database encoding, USE_ICU,
HAVE_LOCALE_T, etc.
* ICU and libc are treated the same way in more places.
* I made it so pg_locale_t is constructed first, then
moved to TopMemoryContext, so that it won't leak in
TopMemoryContext if errors are encountered.
* Introduce pg_strcoll, pg_strncoll, pg_strxfrm, and pg_strnxfrm
so that varlena/hash/verchar code doesn't worry about the
details.
* Add method structure pg_icu_library, borrowed from Thomas's
patch, that provides one convenient place to provide
multiple-ICU-library support.
* Add a hook that allows you to fill in the pg_icu_library
structure however you want while a pg_locale_t is being
constructed. This allows do-it-yourself ICU library
lockdown.

On the negative side, it increases the line count. Part of that is
because adding indirection for the ICU library is just more lines of
code, but a lot of it is just that I used a lot of smaller functions.
Perhaps my style is a bit verbose?

Even though we're close to consensus on how we should offer control
over the ICU libraries, having the hook may be useful for
experimentation, testing, or as a last resort. Right now the hook has
limited information to use to find the right library -- just the ICU
collation name and the version, because that's what we have in the
catalog. But I assume the patch Thomas is working on will change that.

Performance:

I did brief performance sanity tests on several paths and the results
are unremarkable (which is generally good for a refactor). On the path
I was watching most closely, ICU/UTF8/en-US-x-icu, it came in about 2%
faster, which was a pleasant surprise. This was true both when I
disabled abbreviated keys (to stress localized comparison paths) and
also with abbreviated keys enabled. My previous refactoring work[1]/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
ended up a percent or two slower. My guess right now is that I moved
some code around after I noticed that ICU accepts NUL-terminated
strings (by specifying the lenght as -1), and that helped. But I'll
need to profile and look more closely to be more certain of my results,
these are preliminary.

There are a few things that could be done differently:

* I am still a bit confused about why someone would want a collation
with a different lc_collate and lc_ctype in libc; and assuming there is
a reason, why it can't be done with ICU. The way I did the refactoring
tries to accommodate them as different concepts, but I can rip that
out.

* In theory, we could also support multilib libc, and an associated
get_libc_library() and hook, but there are a couple big challenges.
Firstly, if it's the default locale, it relies on setlocale(), so we'd
have to figure out what to do about that. Second, having an a second
version of glibc on your system is not as normal or trivial as having a
second version of ICU.

* I made the hook simple, but all it can do is replace the ICU
library. It's possible that it would want to construct it's own entire
pg_locale_t for some reason, and keep more complex state in a private
pointer, or something like that. It seemed better to keep it simple,
but maybe someone would want more flexibility there?

* I used the library indirection for pretty much all ICU calls,
including the ucnv_ and the uloc_ functions. I did this mainly because,
if we are so paranoid about ICU changing in subtle ways, we might as
well make it possible to lock down everything. I can rip this out, too,
but it didn't add many lines.

Loose ends:

* I need to do something with get_collation_actual_version. I had an
earlier iteration that went through pg_newlocale() and then queried the
resulting pg_locale_t structure, but that changed the error paths in a
way that failed a couple tests, so I left that out.

* Error paths could be improved further to make sure that libc
locale_t and UCollator structures are freed in error paths during
construction. I was thinking about using resowner for this, and then if
the pg_locale_t structure gets moved to TopMemoryContext, just doing a
ResourceOwnerForget. Alternatively, I could just be careful about the
error paths.

* We'd need to adapt this and make sure it works with whatever scheme
we decide is best for finding the right library. I suspect this would
just be adding another parameter to get_icu_library (and the hook) to
represent the new collation provider Oid (and get_icu_library could use
that to look up the library names and load them).

Comments welcome.

[1]: /messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v2-0001-Add-pg_strcoll-and-pg_strncoll.patchtext/x-patch; charset=UTF-8; name=v2-0001-Add-pg_strcoll-and-pg_strncoll.patchDownload

From 5d144e19686a0ae01e1d80a9f8c356079f41b9eb Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v2 1/6] Add pg_strcoll() and pg_strncoll().

Callers with NUL-terminated strings should call the former; callers
with strings and their length should call the latter.
---
 src/backend/utils/adt/pg_locale.c | 420 ++++++++++++++++++++++++++++--
 src/backend/utils/adt/varlena.c   | 230 +---------------
 src/include/utils/pg_locale.h     |   3 +
 3 files changed, 406 insertions(+), 247 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..6cd629ecb4 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -123,6 +129,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1750,356 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
 
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * If the collation is deterministic, break ties with strcmp().
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+		result = strcmp(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+	{
+		result = memcmp(arg1, arg2, Min(len1, len2));
+		if ((result == 0) && (len1 != len2))
+			result = (len1 < len2) ? -1 : 1;
+	}
+
+	return result;
+}
+
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2127,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2175,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c5e7ee7ca2..c904bc0825 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1535,10 +1535,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1555,171 +1551,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
-
-		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 	}
 
 	return result;
@@ -2377,65 +2209,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
-
-	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
-		result = strcmp(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..bf70ae08ca 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,9 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v2-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchtext/x-patch; charset=UTF-8; name=v2-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchDownload

From bbee706cfa8a074c31429b178242c77084af9d6c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:41:38 -0800
Subject: [PATCH v2 2/6] Add pg_strxfrm() and pg_strxfrm_prefix().

Callers with a NUL-terminated string should call the former; callers
with a string and length should call the latter.
---
 src/backend/access/hash/hashfunc.c |  45 ++--
 src/backend/utils/adt/pg_locale.c  | 382 +++++++++++++++++++++++++++++
 src/backend/utils/adt/varchar.c    |  41 ++--
 src/backend/utils/adt/varlena.c    | 142 +++--------
 src/include/utils/pg_locale.h      |  10 +
 5 files changed, 470 insertions(+), 150 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index f890f79ee1..b8136e496f 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -291,21 +291,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -349,21 +347,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6cd629ecb4..133bb03a13 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2099,6 +2099,388 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 }
 
 
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_libc(char *dest, const char *src, size_t destsize,
+					   pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+		 locale->provider);
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can define TRUST_STRXFRM.)
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+	{
+#ifdef TRUST_STRXFRM
+		return true;
+#else
+		return false;
+#endif
+	}
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_prefix_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
 #ifdef USE_ICU
 static void
 init_icu_converter(void)
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index a63c498181..d0bc528e9f 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1019,21 +1019,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1081,21 +1077,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c904bc0825..2dfba4b488 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1887,20 +1887,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -1916,13 +1902,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -1935,10 +1915,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2227,6 +2205,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2239,7 +2218,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2274,14 +2253,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2299,7 +2274,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2307,81 +2282,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2393,12 +2336,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index bf70ae08ca..ceab0d4307 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -103,6 +103,16 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
 extern int pg_strncoll(const char *arg1, size_t len1,
 					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v2-0003-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v2-0003-Refactor-pg_locale_t-routines.patchDownload

From 0145d9050ba3feef163c1d4a13660f904167458a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v2 3/6] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce collation_version_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation() to use multiple static
    functions and avoid allocating in TopMemoryContext until
    necessary
  * refactor get_collation_actual_version() to use
    pg_newlocale()
---
 src/backend/access/hash/hashfunc.c     |  82 ++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 +--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 459 +++++++++++++++++--------
 src/backend/utils/adt/varchar.c        |  62 ++--
 src/backend/utils/adt/varlena.c        |   8 +-
 src/backend/utils/init/postinit.c      |  30 +-
 src/include/utils/pg_locale.h          |  55 +--
 src/include/utils/pg_locale_internal.h |  68 ++++
 12 files changed, 508 insertions(+), 333 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b8136e496f..6d9f014c5b 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -281,36 +281,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -336,7 +328,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -344,30 +336,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 81e54e0ce6..9e84da4891 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 02d462a659..ac05efb558 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 26f498b5df..a4bc7fa5f5 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -87,6 +87,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /* ----------
  * Convenience macros for error handling
@@ -1611,7 +1612,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1619,7 +1620,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1732,7 +1733,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1765,7 +1766,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1854,7 +1855,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1887,7 +1888,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1979,10 +1980,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -2024,10 +2025,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 8e671b9fab..98714a0492 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 
 #define LIKE_TRUE						1
@@ -96,7 +97,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2d3aaaaf6b..28d23ac3ab 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 133bb03a13..0a19845df4 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -128,6 +129,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1333,7 +1339,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1386,7 +1392,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1417,38 +1423,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1482,6 +1456,261 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required. Caller should pass isdefault=true if initializing
+ * default_locale; false otherwise.
+ *
+ * Structures are allocated in CurrentMemoryContext. The libc locale_t or
+ * UCollator is not allocated in any memory context, so the caller should be
+ * sure to call pg_freelocale() to close it.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool isdefault, bool deterministic,
+			 const char *collate, const char *ctype, const char *version)
+{
+	pg_locale_t result = palloc0(sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	if (provider == COLLPROVIDER_LIBC && isdefault)
+	{
+		/*
+		 * When the default locale is libc, the actual locale settings are
+		 * controlled by setlocale(), so there's nothing to do here.
+		 */
+	}
+	else if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = pstrdup(collate);
+	result->ctype = pstrdup(ctype);
+
+	return result;
+}
+
+/*
+ * Move resources from the given locale into the given mcxt, consuming and
+ * freeing the given locale and returning the new one.
+ */
+static pg_locale_t
+pg_movelocale(MemoryContext mcxt, pg_locale_t *plocale)
+{
+	pg_locale_t locale = *plocale;
+	pg_locale_t result;
+
+	Assert(locale != default_locale);
+
+	result = MemoryContextAllocZero(mcxt, sizeof(struct pg_locale_struct));
+
+	result->provider = locale->provider;
+	result->deterministic = locale->deterministic;
+
+	if (locale->provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		if (locale->info.libc.lt != NULL)
+		{
+			/* not in a memory context; just reassign the pointer */
+			result->info.libc.lt = locale->info.libc.lt;
+		}
+#endif
+	}
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+	{
+		/* not in a memory context; just reassign the pointer */
+		result->info.icu.ucol = locale->info.icu.ucol;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	result->collate = MemoryContextStrdup(mcxt, locale->collate);
+	pfree(locale->collate);
+
+	result->ctype = MemoryContextStrdup(mcxt, locale->ctype);
+	pfree(locale->ctype);
+
+	pfree(locale);
+	*plocale = NULL;
+
+	return result;
+}
+
+/*
+ * Free pg_locale_t and close libc locale_t or UCollator.
+ */
+#ifdef USE_ICU
+static void
+pg_freelocale(pg_locale_t locale)
+{
+	if (!locale)
+		return;
+
+	Assert(locale != default_locale);
+
+	pfree(locale->collate);
+	pfree(locale->ctype);
+
+	if (locale->provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		if (locale->info.libc.lt != NULL)
+		{
+#ifndef WIN32
+			freelocale(locale->info.libc.lt);
+#else
+			_free_locale(locale->info.libc.lt);
+#endif
+		}
+#endif
+	}
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+	{
+		ucol_close(locale->info.icu.ucol);
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	pfree(locale);
+}
+#endif
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *version)
+{
+	pg_locale_t temp_locale;
+	bool deterministic;
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	deterministic = true;
+	temp_locale = pg_newlocale(provider, true, deterministic, collate,
+							   ctype, version);
+
+	default_locale = pg_movelocale(TopMemoryContext, &temp_locale);
+}
+
+/*
+ * Return palloc'd version string for the default locale.
+ */
+char *
+default_locale_collation_version()
+{
+	return get_collation_actual_version(default_locale->provider,
+										default_locale->collate);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1506,8 +1735,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1519,107 +1748,65 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t temp_locale;
+		pg_locale_t perm_locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		temp_locale = pg_newlocale(collform->collprovider, false,
+								   collform->collisdeterministic,
+								   collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
-
-			collversionstr = TextDatumGetCString(datum);
 
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1646,13 +1833,10 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
+		/* move into TopMemoryContext */
+		perm_locale = pg_movelocale(TopMemoryContext, &temp_locale);
 
-		cache_entry->locale = resultp;
+		cache_entry->locale = perm_locale;
 	}
 
 	return cache_entry->locale;
@@ -1812,7 +1996,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1855,7 +2039,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2108,7 +2292,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2718,19 +2902,16 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	pg_locale_t locale;
 
-	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+	/*
+	 * Whether it's deterministic doesn't matter in this case, because it
+	 * doesn't affect whether the locale is valid or not; and we're going to
+	 * discard the locale anyway.
+	 */
+	locale = pg_newlocale(COLLPROVIDER_ICU, false, true, icu_locale,
+						  icu_locale, NULL);
+	pg_freelocale(locale);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -2793,10 +2974,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2870,10 +3051,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index d0bc528e9f..52f27d483d 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -757,7 +757,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -802,7 +802,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1010,33 +1010,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1067,35 +1059,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2dfba4b488..a7c39d7afa 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1203,7 +1203,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1601,7 +1601,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1660,7 +1660,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1774,7 +1774,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..c5528cbf64 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,32 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider,
+						dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate,
+						dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : ctype,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
-
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
+		actual_versionstr = default_locale_collation_version();
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +468,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index ceab0d4307..0d7bc0534f 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -64,39 +48,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *version);
+extern char *default_locale_collation_version(void);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -114,10 +71,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v2-0004-Add-method-structure-toward-ICU-multi-library-sup.patchtext/x-patch; charset=UTF-8; name=v2-0004-Add-method-structure-toward-ICU-multi-library-sup.patchDownload

From daef4c5f6ebf6cd73dcae5402feb5ec269a50650 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 7 Dec 2022 11:07:31 -0800
Subject: [PATCH v2 4/6] Add method structure, toward ICU multi-library
 support.

Introduce structure pg_icu_library, which holds pointers to each
required ICU method, and store this as part of pg_locale_t. Each call
to an ICU function instead goes through this structure, so that it can
more easily be replaced by a non-builtin ICU library.

This is a step toward support for multiple ICU libraries, to allow
control over precisely which version of a library is used and prevent
problems from subtle changes in collation order.

Author: Thomas Munro, Jeff Davis
---
 src/backend/commands/collationcmds.c   |  17 +-
 src/backend/utils/adt/formatting.c     |  67 ++++++--
 src/backend/utils/adt/pg_locale.c      | 227 +++++++++++++++++--------
 src/include/utils/pg_locale_internal.h | 113 +++++++++++-
 4 files changed, 322 insertions(+), 102 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 9e84da4891..16b1bcbdc0 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -560,13 +560,14 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
+	pg_icu_library *iculib = get_builtin_icu_library();
 
 	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
+	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
+						localename, iculib->errorName(status))));
 
 	return pstrdup(buf);
 }
@@ -585,11 +586,12 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
+	pg_icu_library *iculib = get_builtin_icu_library();
 
 	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
+	len_uchar = iculib->getDisplayName(localename, "en",
+									   displayname, lengthof(displayname),
+									   &status);
 	if (U_FAILURE(status))
 		return NULL;			/* no good reason to raise an error */
 
@@ -809,12 +811,13 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
+		pg_icu_library *iculib = get_builtin_icu_library();
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
 		 * code duplication.
 		 */
-		for (i = -1; i < uloc_countAvailable(); i++)
+		for (i = -1; i < iculib->countAvailable(); i++)
 		{
 			const char *name;
 			char	   *langtag;
@@ -825,7 +828,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			if (i == -1)
 				name = "";		/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				name = iculib->getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index a4bc7fa5f5..289aa569de 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1600,6 +1600,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1607,6 +1612,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 {
 	UErrorCode	status;
 	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
 	len_dest = len_source;		/* try first with same length */
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
@@ -1624,18 +1630,42 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->ctype, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->ctype, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1701,11 +1731,12 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			int32_t		len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToLower, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1823,11 +1854,12 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToUpper, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1946,11 +1978,12 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case_bi(iculib->strToTitle, mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 0a19845df4..4daff8b7b5 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -70,6 +70,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -135,6 +137,7 @@ static char *IsoLocaleName(const char *);
 static pg_locale_t default_locale = NULL;
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -142,13 +145,17 @@ static pg_locale_t default_locale = NULL;
  */
 static UConverter *icu_converter = NULL;
 
-static void init_icu_converter(void);
-static size_t uchar_length(UConverter *converter,
+static void init_icu_converter(pg_icu_library *iculib);
+static size_t uchar_length(pg_icu_library *iculib,
+						   UConverter *converter,
 						   const char *str, size_t len);
-static int32_t uchar_convert(UConverter *converter,
+static int32_t uchar_convert(pg_icu_library *iculib,
+							 UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, size_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *iculib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1455,6 +1462,59 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+#ifdef USE_ICU
+pg_icu_library *
+get_builtin_icu_library()
+{
+	pg_icu_library *lib;
+
+	/*
+	 * These assignments will fail to compile if an incompatible API change is
+	 * made to some future version of ICU, at which point we might need to
+	 * consider special treatment for different major version ranges, with
+	 * intermediate trampoline functions.
+	 */
+	lib = palloc0(sizeof(*lib));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = ucol_open;
+	lib->closeCollator = ucol_close;
+	lib->getCollatorVersion = ucol_getVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = ucol_strcoll;
+	lib->strcollUTF8 = ucol_strcollUTF8;
+	lib->getSortKey = ucol_getSortKey;
+	lib->nextSortKeyPart = ucol_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = u_strToUpper;
+	lib->strToLower = u_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = ucol_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	/*
+	 * Also assert the size of a couple of types used as output buffers, as a
+	 * canary to tell us to add extra padding in the (unlikely) event that a
+	 * later release makes these values smaller.
+	 */
+	StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+					 "u_versionToString output buffer size changed incompatibly");
+	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+					 "ucol_getVersion output buffer size changed incompatibly");
+
+	return lib;
+}
+#endif
 
 /*
  * Construct a new pg_locale_t object.
@@ -1547,20 +1607,22 @@ pg_newlocale(char provider, bool isdefault, bool deterministic,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
+		pg_icu_library *iculib = get_builtin_icu_library();
 
 		/* collator may be leaked if we encounter an error */
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collate, &status);
+		collator = iculib->openCollator(collate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collate, u_errorName(status))));
+							collate, iculib->errorName(status))));
 
 		if (U_ICU_VERSION_MAJOR_NUM < 54)
-			icu_set_collation_attributes(collator, collate);
+			icu_set_collation_attributes(iculib, collator, collate);
 
 		result->info.icu.ucol = collator;
+		result->info.icu.lib = iculib;
 	}
 #endif
 	else
@@ -1607,6 +1669,10 @@ pg_movelocale(MemoryContext mcxt, pg_locale_t *plocale)
 	{
 		/* not in a memory context; just reassign the pointer */
 		result->info.icu.ucol = locale->info.icu.ucol;
+
+		result->info.icu.lib = MemoryContextAlloc(mcxt, sizeof(pg_icu_library));
+		memcpy(result->info.icu.lib, locale->info.icu.lib, sizeof(pg_icu_library));
+		pfree(locale->info.icu.lib);
 	}
 #endif
 	else
@@ -1656,7 +1722,9 @@ pg_freelocale(pg_locale_t locale)
 #ifdef USE_ICU
 	else if (locale->provider == COLLPROVIDER_ICU)
 	{
-		ucol_close(locale->info.icu.ucol);
+		pg_icu_library *iculib = PG_ICU_LIB(locale);
+		iculib->closeCollator(locale->info.icu.ucol);
+		pfree(locale->info.icu.lib);
 	}
 #endif
 	else
@@ -1858,17 +1926,18 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*iculib = get_builtin_icu_library();
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = iculib->openCollator(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, iculib->errorName(status))));
+		iculib->getCollatorVersion(collator, versioninfo);
+		iculib->closeCollator(collator);
 
-		u_versionToString(versioninfo, buf);
+		iculib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -2120,16 +2189,17 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	UChar	*uchar1,
 			*uchar2;
 	int		 result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 #ifdef HAVE_UCOL_STRCOLLUTF8
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 #endif
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
+	ulen1 = uchar_length(iculib, icu_converter, arg1, len1);
+	ulen2 = uchar_length(iculib, icu_converter, arg2, len2);
 
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
@@ -2140,12 +2210,12 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	uchar1 = (UChar *) buf;
 	uchar2 = (UChar *) (buf + bufsize1);
 
-	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
-	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+	ulen1 = uchar_convert(iculib, icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(iculib, icu_converter, uchar2, ulen2 + 1, arg2, len2);
 
-	result = ucol_strcoll(locale->info.icu.ucol,
-						  uchar1, ulen1,
-						  uchar2, ulen2);
+	result = iculib->strcoll(locale->info.icu.ucol,
+							 uchar1, ulen1,
+							 uchar2, ulen2);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2166,6 +2236,7 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				pg_locale_t locale)
 {
 	int result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
@@ -2175,13 +2246,14 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		UErrorCode	status;
 
 		status = U_ZERO_ERROR;
-		result = ucol_strcollUTF8(locale->info.icu.ucol,
-								  arg1, len1,
-								  arg2, len2,
-								  &status);
+		result = iculib->strcollUTF8(locale->info.icu.ucol,
+									 arg1, len1,
+									 arg2, len2,
+									 &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("collation failed: %s", u_errorName(status))));
+					(errmsg("collation failed: %s",
+							iculib->errorName(status))));
 	}
 	else
 #endif
@@ -2360,12 +2432,13 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 	int32_t	 ulen;
 	size_t   uchar_bsize;
 	Size	 result_bsize;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2374,11 +2447,11 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
-	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
-								   uchar, ulen,
-								   (uint8_t *) dest, destsize);
+	result_bsize = iculib->getSortKey(locale->info.icu.ucol,
+									  uchar, ulen,
+									  (uint8_t *) dest, destsize);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2407,13 +2480,14 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 	UChar			*uchar = NULL;
 	size_t			 uchar_bsize;
 	Size			 result_bsize;
+	pg_icu_library	*iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2422,21 +2496,19 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
 	uiter_setString(&iter, uchar, ulen);
 	state[0] = state[1] = 0;	/* won't need that again */
 	status = U_ZERO_ERROR;
-	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
-										&iter,
-										state,
-										(uint8_t *) dest,
-										destsize,
-										&status);
+	result_bsize = iculib->nextSortKeyPart(
+		locale->info.icu.ucol, &iter, state,
+		(uint8_t *) dest, destsize, &status);
+
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("sort key generation failed: %s",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return result_bsize;
 }
@@ -2445,6 +2517,7 @@ static size_t
 pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 					   size_t destsize, pg_locale_t locale)
 {
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 	size_t result;
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
@@ -2455,19 +2528,17 @@ pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 		uint32_t	state[2];
 		UErrorCode	status;
 
-		uiter_setUTF8(&iter, src, srclen);
+		iculib->setUTF8(&iter, src, srclen);
 		state[0] = state[1] = 0;	/* won't need that again */
 		status = U_ZERO_ERROR;
-		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
-									  &iter,
-									  state,
-									  (uint8_t *) dest,
-									  destsize,
-									  &status);
+		result = iculib->nextSortKeyPart(
+			locale->info.icu.ucol, &iter, state,
+			(uint8_t *) dest, destsize, &status);
+
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("sort key generation failed: %s",
-							u_errorName(status))));
+							iculib->errorName(status))));
 	}
 	else
 		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
@@ -2667,7 +2738,7 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 
 #ifdef USE_ICU
 static void
-init_icu_converter(void)
+init_icu_converter(pg_icu_library *iculib)
 {
 	const char *icu_encoding_name;
 	UErrorCode	status;
@@ -2684,11 +2755,11 @@ init_icu_converter(void)
 						pg_encoding_to_char(GetDatabaseEncoding()))));
 
 	status = U_ZERO_ERROR;
-	conv = ucnv_open(icu_encoding_name, &status);
+	conv = iculib->openConverter(icu_encoding_name, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open ICU converter for encoding \"%s\": %s",
-						icu_encoding_name, u_errorName(status))));
+						icu_encoding_name, iculib->errorName(status))));
 
 	icu_converter = conv;
 }
@@ -2697,14 +2768,15 @@ init_icu_converter(void)
  * Find length, in UChars, of given string if converted to UChar string.
  */
 static size_t
-uchar_length(UConverter *converter, const char *str, size_t len)
+uchar_length(pg_icu_library *iculib, UConverter *converter, const char *str, size_t len)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
-	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	ulen = iculib->toUChars(converter, NULL, 0, str, len, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2713,16 +2785,17 @@ uchar_length(UConverter *converter, const char *str, size_t len)
  * return the length (in UChars).
  */
 static int32_t
-uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
-			  const char *src, size_t srclen)
+uchar_convert(pg_icu_library *iculib, UConverter *converter, UChar *dest,
+			  int32_t destlen, const char *src, size_t srclen)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
 	status = U_ZERO_ERROR;
-	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	ulen = iculib->toUChars(converter, dest, destlen, src, srclen, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2739,16 +2812,17 @@ uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
  * result length instead.
  */
 int32_t
-icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
+icu_to_uchar(pg_icu_library *iculib, UChar **buff_uchar, const char *buff,
+			 size_t nbytes)
 {
 	int32_t len_uchar;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	len_uchar = uchar_length(icu_converter, buff, nbytes);
+	len_uchar = uchar_length(iculib, icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-	len_uchar = uchar_convert(icu_converter,
+	len_uchar = uchar_convert(iculib, icu_converter,
 							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
@@ -2766,30 +2840,32 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * The result string is nul-terminated.
  */
 int32_t
-icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
+icu_from_uchar(pg_icu_library *iculib, char **result, const UChar *buff_uchar,
+			   int32_t len_uchar)
 {
 	UErrorCode	status;
 	int32_t		len_result;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, NULL, 0,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, NULL, 0,
+									buff_uchar, len_uchar, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	*result = palloc(len_result + 1);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, *result,
+									len_result + 1, buff_uchar,
+									len_uchar, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return len_result;
 }
@@ -2805,7 +2881,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *iculib, UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -2879,7 +2956,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				iculib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -2888,7 +2965,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, iculib->errorName(status))));
 		}
 	}
 }
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 33465ad92d..54445d8b87 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -14,6 +14,8 @@
 #define _PG_LOCALE_INTERNAL_
 
 #ifdef USE_ICU
+#include <unicode/ubrk.h>
+#include <unicode/ucnv.h>
 #include <unicode/ucol.h>
 #endif
 
@@ -30,6 +32,104 @@
 #endif
 #endif
 
+#ifdef USE_ICU
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*openCollator) (const char *loc, UErrorCode *status);
+	void		(*closeCollator) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	UConverter *(*openConverter) (const char *converterName,
+								  UErrorCode *  	err);
+	void		(*closeConverter) (UConverter *converter);
+	int32_t		(*fromUChars) (UConverter *cnv,
+							   char *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UErrorCode *pErrorCode);
+	int32_t		(*toUChars) (UConverter *cnv,
+							 UChar *dest,
+							 int32_t destCapacity,
+							 const char *src,
+							 int32_t srcLength,
+							 UErrorCode *pErrorCode);
+	int32_t		(*toLanguageTag) (const char *localeID,
+								  char *langtag,
+								  int32_t langtagCapacity,
+								  UBool strict,
+								  UErrorCode *err);
+	int32_t		(*getDisplayName) (const char *localeID,
+								   const char *inLocaleID,
+								   UChar *result,
+								   int32_t maxResultSize,
+								   UErrorCode *err);
+	int32_t		(*countAvailable) (void);
+	const char *(*getAvailable) (int32_t n);
+} pg_icu_library;
+
+#endif
+
 /*
  * We define our own wrapper around locale_t so we can keep the same
  * function signatures for all builds, while not having to create a
@@ -53,7 +153,8 @@ struct pg_locale_struct
 #ifdef USE_ICU
 		struct
 		{
-			UCollator	*ucol;
+			UCollator		*ucol;
+			pg_icu_library	*lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -61,8 +162,14 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+#define PG_ICU_COL(x) ((x)->info.icu.ucol)
+
+extern pg_icu_library *get_builtin_icu_library(void);
+extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
+							const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
+							  const UChar *buff_uchar, int32_t len_uchar);
 #endif
 
 #endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v2-0005-Add-get_icu_library_hook.patchtext/x-patch; charset=UTF-8; name=v2-0005-Add-get_icu_library_hook.patchDownload

From 7dca88aa9f51e7674151bb90506b5305f9107bac Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 16:14:18 -0800
Subject: [PATCH v2 5/6] Add get_icu_library_hook.

Controls how ICU library symbols are loaded.
---
 src/backend/commands/collationcmds.c   |  6 +--
 src/backend/utils/adt/pg_locale.c      | 56 ++++++++++++++++++++++----
 src/include/utils/pg_locale_internal.h | 10 ++++-
 3 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 16b1bcbdc0..f439e832da 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -560,7 +560,7 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
-	pg_icu_library *iculib = get_builtin_icu_library();
+	pg_icu_library *iculib = get_icu_library(NULL, NULL, NULL);
 
 	status = U_ZERO_ERROR;
 	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
@@ -586,7 +586,7 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
-	pg_icu_library *iculib = get_builtin_icu_library();
+	pg_icu_library *iculib = get_icu_library(NULL, NULL, NULL);
 
 	status = U_ZERO_ERROR;
 	len_uchar = iculib->getDisplayName(localename, "en",
@@ -811,7 +811,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
-		pg_icu_library *iculib = get_builtin_icu_library();
+		pg_icu_library *iculib = get_icu_library(NULL, NULL, NULL);
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 4daff8b7b5..b75a825df6 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -109,6 +109,34 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/*
+ * get_icu_library_hook can be set to control how the pg_icu_library is
+ * constructed inside a pg_locale_t structure, and therefore which specific
+ * ICU symbols are called.
+ *
+ * Without the hook, Postgres constructs the pg_icu_library from the version
+ * of ICU that Postgres is linked against at build time.
+ *
+ * The hook can instead load the ICU symbols from a different version of the
+ * ICU library on the system, which can avoid problems when the collation
+ * subtly changes across different versions of ICU.
+ *
+ * If the hook returns true, it indicates that it has successfully filled in
+ * the pg_icu_library structure that was passed in. If it returns false,
+ * Postgres will fill in the structure itself. The version of the collation
+ * returned does not need to match exactly the version that was passed in;
+ * though if not, Postgres will issue a WARNING.
+ *
+ * XXX: For now, the only information the hook has access to is the ICU
+ * collation name, ICU ctype, and the ICU version string that was obtained at
+ * the time the collation was created (or when it was last refreshed). We
+ * should consider what other information can be provided to allow for greater
+ * control which library is loaded.
+ */
+#ifdef USE_ICU
+get_icu_library_hook_type get_icu_library_hook = NULL;
+#endif
+
 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
@@ -1463,18 +1491,15 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 #ifdef USE_ICU
-pg_icu_library *
-get_builtin_icu_library()
+static bool
+get_builtin_icu_library(pg_icu_library *lib)
 {
-	pg_icu_library *lib;
-
 	/*
 	 * These assignments will fail to compile if an incompatible API change is
 	 * made to some future version of ICU, at which point we might need to
 	 * consider special treatment for different major version ranges, with
 	 * intermediate trampoline functions.
 	 */
-	lib = palloc0(sizeof(*lib));
 	lib->getICUVersion = u_getVersion;
 	lib->getUnicodeVersion = u_getUnicodeVersion;
 	lib->getCLDRVersion = ulocdata_getCLDRVersion;
@@ -1512,8 +1537,25 @@ get_builtin_icu_library()
 	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
 					 "ucol_getVersion output buffer size changed incompatibly");
 
+	return true;
+}
+
+pg_icu_library *
+get_icu_library(const char *collate, const char *ctype, const char *version)
+{
+	pg_icu_library *lib = palloc0(sizeof(*lib));
+	bool filled = false;
+
+	if (get_icu_library_hook != NULL)
+		filled = get_icu_library_hook(lib, collate, ctype, version);
+
+	if(!filled)
+		filled = get_builtin_icu_library(lib);
+
+	Assert(filled);
 	return lib;
 }
+
 #endif
 
 /*
@@ -1607,7 +1649,7 @@ pg_newlocale(char provider, bool isdefault, bool deterministic,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
-		pg_icu_library *iculib = get_builtin_icu_library();
+		pg_icu_library *iculib = get_icu_library(collate, ctype, version);
 
 		/* collator may be leaked if we encounter an error */
 
@@ -1926,7 +1968,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
-		pg_icu_library	*iculib = get_builtin_icu_library();
+		pg_icu_library	*iculib = get_icu_library(NULL, NULL, NULL);
 
 		status = U_ZERO_ERROR;
 		collator = iculib->openCollator(collcollate, &status);
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 54445d8b87..1be0205391 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -162,10 +162,18 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
+
+typedef bool (*get_icu_library_hook_type)(
+	pg_icu_library *lib, const char *collate, const char *ctype,
+	const char *version);
+
+extern PGDLLIMPORT get_icu_library_hook_type get_icu_library_hook;
+
 #define PG_ICU_LIB(x) ((x)->info.icu.lib)
 #define PG_ICU_COL(x) ((x)->info.icu.ucol)
 
-extern pg_icu_library *get_builtin_icu_library(void);
+extern pg_icu_library *get_icu_library(const char *collate, const char *ctype,
+									   const char *version);
 extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
 							const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
-- 
2.34.1

#137

Peter Eisentraut

peter.eisentraut@enterprisedb.com

about 3 years ago

In reply to: Thomas Munro (#135)

Re: Collation version tracking for macOS

On 05.12.22 22:33, Thomas Munro wrote:

On Tue, Dec 6, 2022 at 6:45 AM Joe Conway <mail@joeconway.com> wrote:

On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1. I think we should seriously consider provider = ICU63. I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work. Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

CREATE COLLATION PROVIDER icu63 TYPE icu
AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

+1

I like this. The provider kind defines which path we take in our code,
and the specific library unambiguously defines a specific collation
behavior (I think, ignoring bugs?)

OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.

I'm reviewing the commit fest entry
https://commitfest.postgresql.org/41/3956/, which points to this thread.
It appears that the above patch did not come about in time. The patch
of record is now Jeff's refactoring patch, which is also tracked in
another commit fest entry (https://commitfest.postgresql.org/41/4058/).
So as a matter of procedure, we should probably close this commit fest
entry for now. (Maybe we should also use a different thread subject in
the future.)

I have a few quick comments on the above syntax example:

There is currently a bunch of locale-using code that selects different
code paths by "collation provider", i.e., a libc-based code path and an
ICU-based code path (and sometimes also a default provider path). The
above proposal would shift the terminology and would probably require
some churn at those sites, in that they would now have to select by
"collation provider type". We could probably avoid that by shifting the
terms a bit, so instead of the suggested

provider type -> provider

we could use

provider -> version of that provider

(or some other actual term), which would leave the meaning of "provider"
unchanged as far as locale-using code is concerned. At least that's my
expectation, since no code for this has been seen yet. We should keep
this in mind in any case.

Also, the above example exposes a lot of operating system level details.
This creates issues with dump/restore, which some of the earlier
patches avoided by using a path-based approach, and it would also
require some thoughts about permissions. We probably want
non-superusers to be able to interact with this system somehow, for
upgrading (for some meaning of that action) indexes etc. without
superuser access. The more stuff from the OS we expose, the more stuff
we have to be able to lock down again in a usable manner.

(The search-by-collversion approach can probably avoid those issues better.)

#138

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#136)

4 attachment(s)

Re: Collation version tracking for macOS

Attached are a new set of patches, including a major enhancement: the
icu_multilib contrib module.

The pure refactoring patches have gone into this thread:

This patch series is dependent on those patches, and contains:

0001: Support multiple ICU libraries
0002: Add test

Attachments:

v7-0001-Support-multiple-ICU-collation-provider-libraries.patchtext/x-patch; charset=UTF-8; name=v7-0001-Support-multiple-ICU-collation-provider-libraries.patchDownload

From e8e439162f89eab7c656bdc11d7b23ee7f02d809 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 7 Dec 2022 11:07:31 -0800
Subject: [PATCH v7 1/4] Support multiple ICU collation provider libraries.

Introduce structure pg_icu_library, which holds pointers to each
required ICU method, and store this as part of pg_locale_t. Each call
to an ICU function instead goes through this structure, so that it can
more easily be replaced by a non-builtin ICU library.

Offer a hook to allow an extension to control which ICU library is
loaded for a given locale and version.

Author: Thomas Munro, Jeff Davis
---
 src/backend/commands/collationcmds.c   |  17 +-
 src/backend/utils/adt/formatting.c     |  67 ++++--
 src/backend/utils/adt/pg_locale.c      | 316 ++++++++++++++++++-------
 src/include/utils/pg_locale_internal.h | 124 +++++++++-
 4 files changed, 414 insertions(+), 110 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index d878be2fb8..217c13f066 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -566,13 +566,14 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
+	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
+						localename, iculib->errorName(status))));
 
 	return pstrdup(buf);
 }
@@ -591,11 +592,12 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
+	len_uchar = iculib->getDisplayName(localename, "en",
+									   displayname, lengthof(displayname),
+									   &status);
 	if (U_FAILURE(status))
 		return NULL;			/* no good reason to raise an error */
 
@@ -921,12 +923,13 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
+		pg_icu_library *iculib = get_default_icu_library();
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
 		 * code duplication.
 		 */
-		for (i = -1; i < uloc_countAvailable(); i++)
+		for (i = -1; i < iculib->countAvailable(); i++)
 		{
 			const char *name;
 			char	   *langtag;
@@ -937,7 +940,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			if (i == -1)
 				name = "";		/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				name = iculib->getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f2fed8a2d8..40bfad075b 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1566,6 +1566,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1573,6 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 {
 	UErrorCode	status;
 	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
 	len_dest = len_source;		/* try first with same length */
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
@@ -1590,18 +1596,42 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->ctype, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->ctype, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1667,11 +1697,12 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			int32_t		len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToLower, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1789,11 +1820,12 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToUpper, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1912,11 +1944,12 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case_bi(iculib->strToTitle, mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index d3ca3c5011..10b179cb48 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -70,6 +70,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -110,6 +112,41 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/*
+ * The precise version of a collation provider library is important, because
+ * subtle changes in collation between library versions can corrupt
+ * indexes. This hook allows control over how collation provider libraries are
+ * loaded.
+ *
+ * If the hook is not set, or if it returns NULL, Postgres constructs the
+ * pg_icu_library structure from the symbols Postgres is linked with at build
+ * time.
+ *
+ * If the hook is set, it can instead construct the pg_icu_library structure
+ * using custom logic. Ordinarily, this custom logic would involve finding a
+ * specific known version of the collation provider library, and dynamically
+ * loading the necessary symbols. If the collation version obtained from the
+ * library does not match the collation version recorded in the catalog,
+ * Postgres will issue a WARNING.
+ *
+ * The returned structure must be allocated in TopMemoryContext, and the
+ * associated symbols must remain valid permanently. It's expected that the
+ * hook will maintain a small finite list of libraries and return a pointer to
+ * the appropraite one, so that resource use remains bounded.
+ *
+ * XXX: For now, the only information the hook has access to is the ICU locale
+ * name, and the collation version as recorded in the catalog. We should
+ * consider what other information can be provided to allow for greater
+ * control over which library is loaded.
+ */
+#ifdef USE_ICU
+get_icu_library_hook_type get_icu_library_hook = NULL;
+#endif
+
+#ifdef USE_ICU
+static pg_icu_library *builtin_icu_library = NULL;
+#endif
+
 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
@@ -136,8 +173,12 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+#ifdef USE_ICU
+static pg_icu_library *default_icu = NULL;
+#endif
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -145,13 +186,17 @@ static pg_locale_t default_locale = NULL;
  */
 static UConverter *icu_converter = NULL;
 
-static void init_icu_converter(void);
-static size_t uchar_length(UConverter *converter,
+static void init_icu_converter(pg_icu_library *iculib);
+static size_t uchar_length(pg_icu_library *iculib,
+						   UConverter *converter,
 						   const char *str, size_t len);
-static int32_t uchar_convert(UConverter *converter,
+static int32_t uchar_convert(pg_icu_library *iculib,
+							 UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, size_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *iculib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1458,6 +1503,85 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+#ifdef USE_ICU
+pg_icu_library *
+get_builtin_icu_library()
+{
+	UVersionInfo version_info;
+	pg_icu_library *lib;
+
+	if (builtin_icu_library != NULL)
+		return builtin_icu_library;
+
+	/*
+	 * These assignments will fail to compile if an incompatible API change is
+	 * made to some future version of ICU, at which point we might need to
+	 * consider special treatment for different major version ranges, with
+	 * intermediate trampoline functions.
+	 */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = ucol_open;
+	lib->closeCollator = ucol_close;
+	lib->getCollatorVersion = ucol_getVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = ucol_strcoll;
+	lib->strcollUTF8 = ucol_strcollUTF8;
+	lib->getSortKey = ucol_getSortKey;
+	lib->nextSortKeyPart = ucol_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = u_strToUpper;
+	lib->strToLower = u_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = ucol_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	lib->getICUVersion(version_info);
+	lib->major_version = version_info[0];
+	lib->minor_version = version_info[1];
+	lib->libicui18n_name = "<builtin>";
+	lib->libicuuc_name = "<builtin>";
+
+	/*
+	 * Also assert the size of a couple of types used as output buffers, as a
+	 * canary to tell us to add extra padding in the (unlikely) event that a
+	 * later release makes these values smaller.
+	 */
+	StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+					 "u_versionToString output buffer size changed incompatibly");
+	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+					 "ucol_getVersion output buffer size changed incompatibly");
+
+	builtin_icu_library = lib;
+	return lib;
+}
+
+static pg_icu_library *
+get_icu_library(Oid collid, const char *locale, const char *version)
+{
+	pg_icu_library *lib = NULL;
+
+	if (get_icu_library_hook != NULL)
+		lib = get_icu_library_hook(collid, locale, version);
+
+	if (!lib)
+		lib = get_builtin_icu_library();
+
+	return lib;
+}
+
+#endif
 
 /*
  * Construct a new pg_locale_t object.
@@ -1477,8 +1601,8 @@ report_newlocale_failure(const char *localename)
  * effectively permanent.
  */
 static pg_locale_t
-pg_newlocale(char provider, bool deterministic, const char *collate,
-			 const char *ctype, const char *version)
+pg_newlocale(Oid collid, char provider, bool deterministic,
+			 const char *collate, const char *ctype, const char *version)
 {
 	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
 											sizeof(struct pg_locale_struct));
@@ -1554,20 +1678,24 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
+		pg_icu_library *iculib = get_icu_library(collid, collate, version);
+
+		Assert(strcmp(collate, ctype) == 0);
 
 		/* collator may be leaked if we encounter an error */
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collate, &status);
+		collator = iculib->openCollator(collate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collate, u_errorName(status))));
+							collate, iculib->errorName(status))));
 
 		if (U_ICU_VERSION_MAJOR_NUM < 54)
-			icu_set_collation_attributes(collator, collate);
+			icu_set_collation_attributes(iculib, collator, collate);
 
 		result->info.icu.ucol = collator;
+		result->info.icu.lib = iculib;
 	}
 #endif
 	else
@@ -1589,6 +1717,14 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+#ifdef USE_ICU
+pg_icu_library *
+get_default_icu_library()
+{
+	return default_icu;
+}
+#endif
+
 /*
  * Initialize default database locale.
  */
@@ -1596,6 +1732,10 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+#ifdef USE_ICU
+	default_icu = get_icu_library(DEFAULT_COLLATION_OID, iculocale, version);
+#endif
+
 	/*
 	 * For the purposes of pg_locale_t, if the provider is ICU, we use
 	 * iculocale for both collate and ctype.
@@ -1613,7 +1753,8 @@ init_default_locale(char provider, const char *collate, const char *ctype,
 	 * locales currently don't support pattern matching, which would break a
 	 * lot of things if applied globally.
 	 */
-	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+	default_locale = pg_newlocale(DEFAULT_COLLATION_OID, provider, true,
+								  collate, ctype, version);
 }
 
 /*
@@ -1699,7 +1840,7 @@ pg_newlocale_from_collation(Oid collid)
 			/* shouldn't happen */
 			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		locale = pg_newlocale(collform->collprovider,
+		locale = pg_newlocale(collid, collform->collprovider,
 							  collform->collisdeterministic,
 							  collate, ctype, collversionstr);
 
@@ -1759,17 +1900,19 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*iculib = get_icu_library(InvalidOid, collcollate,
+												  NULL);
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = iculib->openCollator(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, iculib->errorName(status))));
+		iculib->getCollatorVersion(collator, versioninfo);
+		iculib->closeCollator(collator);
 
-		u_versionToString(versioninfo, buf);
+		iculib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -2021,16 +2164,17 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	UChar	*uchar1,
 			*uchar2;
 	int		 result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 #ifdef HAVE_UCOL_STRCOLLUTF8
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 #endif
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
+	ulen1 = uchar_length(iculib, icu_converter, arg1, len1);
+	ulen2 = uchar_length(iculib, icu_converter, arg2, len2);
 
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
@@ -2041,12 +2185,12 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	uchar1 = (UChar *) buf;
 	uchar2 = (UChar *) (buf + bufsize1);
 
-	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
-	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+	ulen1 = uchar_convert(iculib, icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(iculib, icu_converter, uchar2, ulen2 + 1, arg2, len2);
 
-	result = ucol_strcoll(locale->info.icu.ucol,
-						  uchar1, ulen1,
-						  uchar2, ulen2);
+	result = iculib->strcoll(locale->info.icu.ucol,
+							 uchar1, ulen1,
+							 uchar2, ulen2);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2067,6 +2211,7 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				pg_locale_t locale)
 {
 	int result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
@@ -2076,13 +2221,14 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		UErrorCode	status;
 
 		status = U_ZERO_ERROR;
-		result = ucol_strcollUTF8(locale->info.icu.ucol,
-								  arg1, len1,
-								  arg2, len2,
-								  &status);
+		result = iculib->strcollUTF8(locale->info.icu.ucol,
+									 arg1, len1,
+									 arg2, len2,
+									 &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("collation failed: %s", u_errorName(status))));
+					(errmsg("collation failed: %s",
+							iculib->errorName(status))));
 	}
 	else
 #endif
@@ -2242,12 +2388,13 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 	int32_t	 ulen;
 	size_t   uchar_bsize;
 	Size	 result_bsize;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2256,11 +2403,11 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
-	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
-								   uchar, ulen,
-								   (uint8_t *) dest, destsize);
+	result_bsize = iculib->getSortKey(locale->info.icu.ucol,
+									  uchar, ulen,
+									  (uint8_t *) dest, destsize);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2289,13 +2436,14 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 	UChar			*uchar = NULL;
 	size_t			 uchar_bsize;
 	Size			 result_bsize;
+	pg_icu_library	*iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2304,21 +2452,19 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
 	uiter_setString(&iter, uchar, ulen);
 	state[0] = state[1] = 0;	/* won't need that again */
 	status = U_ZERO_ERROR;
-	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
-										&iter,
-										state,
-										(uint8_t *) dest,
-										destsize,
-										&status);
+	result_bsize = iculib->nextSortKeyPart(
+		locale->info.icu.ucol, &iter, state,
+		(uint8_t *) dest, destsize, &status);
+
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("sort key generation failed: %s",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return result_bsize;
 }
@@ -2327,6 +2473,7 @@ static size_t
 pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 					   size_t destsize, pg_locale_t locale)
 {
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 	size_t result;
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
@@ -2337,19 +2484,17 @@ pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 		uint32_t	state[2];
 		UErrorCode	status;
 
-		uiter_setUTF8(&iter, src, srclen);
+		iculib->setUTF8(&iter, src, srclen);
 		state[0] = state[1] = 0;	/* won't need that again */
 		status = U_ZERO_ERROR;
-		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
-									  &iter,
-									  state,
-									  (uint8_t *) dest,
-									  destsize,
-									  &status);
+		result = iculib->nextSortKeyPart(
+			locale->info.icu.ucol, &iter, state,
+			(uint8_t *) dest, destsize, &status);
+
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("sort key generation failed: %s",
-							u_errorName(status))));
+							iculib->errorName(status))));
 	}
 	else
 		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
@@ -2546,7 +2691,7 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 
 #ifdef USE_ICU
 static void
-init_icu_converter(void)
+init_icu_converter(pg_icu_library *iculib)
 {
 	const char *icu_encoding_name;
 	UErrorCode	status;
@@ -2563,11 +2708,11 @@ init_icu_converter(void)
 						pg_encoding_to_char(GetDatabaseEncoding()))));
 
 	status = U_ZERO_ERROR;
-	conv = ucnv_open(icu_encoding_name, &status);
+	conv = iculib->openConverter(icu_encoding_name, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open ICU converter for encoding \"%s\": %s",
-						icu_encoding_name, u_errorName(status))));
+						icu_encoding_name, iculib->errorName(status))));
 
 	icu_converter = conv;
 }
@@ -2576,14 +2721,15 @@ init_icu_converter(void)
  * Find length, in UChars, of given string if converted to UChar string.
  */
 static size_t
-uchar_length(UConverter *converter, const char *str, size_t len)
+uchar_length(pg_icu_library *iculib, UConverter *converter, const char *str, size_t len)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
-	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	ulen = iculib->toUChars(converter, NULL, 0, str, len, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2592,16 +2738,17 @@ uchar_length(UConverter *converter, const char *str, size_t len)
  * return the length (in UChars).
  */
 static int32_t
-uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
-			  const char *src, size_t srclen)
+uchar_convert(pg_icu_library *iculib, UConverter *converter, UChar *dest,
+			  int32_t destlen, const char *src, size_t srclen)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
 	status = U_ZERO_ERROR;
-	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	ulen = iculib->toUChars(converter, dest, destlen, src, srclen, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2618,16 +2765,17 @@ uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
  * result length instead.
  */
 int32_t
-icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
+icu_to_uchar(pg_icu_library *iculib, UChar **buff_uchar, const char *buff,
+			 size_t nbytes)
 {
 	int32_t len_uchar;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	len_uchar = uchar_length(icu_converter, buff, nbytes);
+	len_uchar = uchar_length(iculib, icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-	len_uchar = uchar_convert(icu_converter,
+	len_uchar = uchar_convert(iculib, icu_converter,
 							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
@@ -2645,30 +2793,32 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * The result string is nul-terminated.
  */
 int32_t
-icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
+icu_from_uchar(pg_icu_library *iculib, char **result, const UChar *buff_uchar,
+			   int32_t len_uchar)
 {
 	UErrorCode	status;
 	int32_t		len_result;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, NULL, 0,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, NULL, 0,
+									buff_uchar, len_uchar, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	*result = palloc(len_result + 1);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, *result,
+									len_result + 1, buff_uchar,
+									len_uchar, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return len_result;
 }
@@ -2684,7 +2834,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *iculib, UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -2758,7 +2909,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				iculib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -2767,7 +2918,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, iculib->errorName(status))));
 		}
 	}
 }
@@ -2783,17 +2934,18 @@ check_icu_locale(const char *icu_locale)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	pg_icu_library *iculib = get_icu_library(InvalidOid, icu_locale, NULL);
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = iculib->openCollator(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, iculib->errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+		icu_set_collation_attributes(iculib, collator, icu_locale);
+	iculib->closeCollator(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 33465ad92d..b600604585 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -14,6 +14,8 @@
 #define _PG_LOCALE_INTERNAL_
 
 #ifdef USE_ICU
+#include <unicode/ubrk.h>
+#include <unicode/ucnv.h>
 #include <unicode/ucol.h>
 #endif
 
@@ -30,6 +32,108 @@
 #endif
 #endif
 
+#ifdef USE_ICU
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	char	   *libicui18n_name;
+	char	   *libicuuc_name;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*openCollator) (const char *loc, UErrorCode *status);
+	void		(*closeCollator) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	UConverter *(*openConverter) (const char *converterName,
+								  UErrorCode *  	err);
+	void		(*closeConverter) (UConverter *converter);
+	int32_t		(*fromUChars) (UConverter *cnv,
+							   char *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UErrorCode *pErrorCode);
+	int32_t		(*toUChars) (UConverter *cnv,
+							 UChar *dest,
+							 int32_t destCapacity,
+							 const char *src,
+							 int32_t srcLength,
+							 UErrorCode *pErrorCode);
+	int32_t		(*toLanguageTag) (const char *localeID,
+								  char *langtag,
+								  int32_t langtagCapacity,
+								  UBool strict,
+								  UErrorCode *err);
+	int32_t		(*getDisplayName) (const char *localeID,
+								   const char *inLocaleID,
+								   UChar *result,
+								   int32_t maxResultSize,
+								   UErrorCode *err);
+	int32_t		(*countAvailable) (void);
+	const char *(*getAvailable) (int32_t n);
+} pg_icu_library;
+
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+
+#endif
+
 /*
  * We define our own wrapper around locale_t so we can keep the same
  * function signatures for all builds, while not having to create a
@@ -53,7 +157,8 @@ struct pg_locale_struct
 #ifdef USE_ICU
 		struct
 		{
-			UCollator	*ucol;
+			UCollator		*ucol;
+			pg_icu_library	*lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -61,8 +166,19 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
+
+typedef pg_icu_library *(*get_icu_library_hook_type)(
+	Oid collid, const char *locale, const char *version);
+
+extern PGDLLIMPORT get_icu_library_hook_type get_icu_library_hook;
+
+extern pg_icu_library *get_default_icu_library(void);
+extern pg_icu_library *get_builtin_icu_library(void);
+extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
+							const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
+							  const UChar *buff_uchar, int32_t len_uchar);
+
+#endif							/* USE_ICU */
 
 #endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v7-0002-Add-test-module-for-icu-collation-provider-hook.patchtext/x-patch; charset=UTF-8; name=v7-0002-Add-test-module-for-icu-collation-provider-hook.patchDownload

From 02ee737d4ef9ddcea7c1469fc94320273d448878 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 21 Dec 2022 19:50:10 -0800
Subject: [PATCH v7 2/4] Add test module for icu collation provider hook.

---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_collator_lib_hooks/Makefile  |  24 ++
 .../test_collator_lib_hooks/meson.build       |  37 +++
 .../test_collator_lib_hooks/t/001_icu.pl      | 153 ++++++++++++
 .../test_collator_lib_hooks.c                 |  43 ++++
 .../test_collator_lib_hooks.control           |   4 +
 .../test_collator_lib_hooks.h                 |  28 +++
 .../test_collator_lib_hooks/test_icu_hook.c   | 228 ++++++++++++++++++
 9 files changed, 519 insertions(+)
 create mode 100644 src/test/modules/test_collator_lib_hooks/Makefile
 create mode 100644 src/test/modules/test_collator_lib_hooks/meson.build
 create mode 100644 src/test/modules/test_collator_lib_hooks/t/001_icu.pl
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_icu_hook.c

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..8a90b206a1 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -15,6 +15,7 @@ SUBDIRS = \
 		  snapshot_too_old \
 		  spgist_name_ops \
 		  test_bloomfilter \
+		  test_collator_lib_hooks \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
 		  test_ddl_deparse \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..a59cc9a89f 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -11,6 +11,7 @@ subdir('snapshot_too_old')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
+subdir('test_collator_lib_hooks')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
 subdir('test_ddl_deparse')
diff --git a/src/test/modules/test_collator_lib_hooks/Makefile b/src/test/modules/test_collator_lib_hooks/Makefile
new file mode 100644
index 0000000000..b44c8f5914
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_collator_lib_hooks/Makefile
+
+MODULE_big = test_collator_lib_hooks
+OBJS = \
+	$(WIN32RES) \
+	test_collator_lib_hooks.o test_icu_hook.o
+PGFILEDESC = "test_collator_lib_hooks - test collator provider library hooks"
+
+EXTENSION = test_collator_lib_hooks
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_collator_lib_hooks
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+export with_icu
diff --git a/src/test/modules/test_collator_lib_hooks/meson.build b/src/test/modules/test_collator_lib_hooks/meson.build
new file mode 100644
index 0000000000..0190a856e0
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/meson.build
@@ -0,0 +1,37 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_collator_lib_hooks_sources = files(
+  'test_collator_lib_hooks.c',
+  'test_icu_hook.c',
+)
+
+if host_system == 'windows'
+  test_collator_lib_hooks_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_collator_lib_hooks',
+    '--FILEDESC', 'test_collator_lib_hooks - test collation provider library hooks',])
+endif
+
+test_collator_lib_hooks = shared_module('test_collator_lib_hooks',
+  test_collator_lib_hooks_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_collator_lib_hooks
+
+install_data(
+  'test_collator_lib_hooks.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_collator_lib_hooks',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_icu.pl',
+    ],
+    'env': {
+      'with_icu': icu.found() ? 'yes' : 'no',
+    },
+  },
+}
diff --git a/src/test/modules/test_collator_lib_hooks/t/001_icu.pl b/src/test/modules/test_collator_lib_hooks/t/001_icu.pl
new file mode 100644
index 0000000000..d3b5729527
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/t/001_icu.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collator_lib_hooks'
+});
+$node->start;
+
+$node->safe_psql('postgres',
+	q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' ICU_LOCALE 'DESC' ENCODING 'UTF8' TEMPLATE template0}
+);
+
+# setup
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_asc (PROVIDER=icu, LOCALE='ASC')]);
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_desc (PROVIDER=icu, LOCALE='DESC')]);
+
+$node->safe_psql('dbicu', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('dbicu',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $version_db =
+  $node->safe_psql('dbicu',
+	  qq[SELECT datcollversion FROM pg_database WHERE datname='dbicu']);
+is($version_db, '2.72',
+	'database "dbicu" has correct version 2.72'
+);
+
+my $version_asc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, '2.72',
+	'collation "test_asc" has correct version 2.72'
+);
+
+my $version_desc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, '2.72',
+	'collation "test_desc" has correct version 2.72'
+);
+
+my $res_sort_expected = "aBcD
+fGhI
+wXyZ";
+
+my $res_reversesort_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison, 't',
+	'correct comparison'
+);
+
+# test reverse comparison (database)
+
+my $dbcomparison_reverse =
+  $node->safe_psql('dbicu', qq[SELECT 'aBcD' < 'wXyZ']);
+is($dbcomparison_reverse, 'f',
+	'correct reverse comparison (database)'
+);
+
+# test reverse comparison
+
+my $comparison_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_reverse, 'f',
+	'correct reverse comparison'
+);
+
+# test asc sort
+
+my $res_sort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($res_sort, $res_sort_expected,
+	'correct ascending sort'
+);
+
+# test desc sort
+
+my $res_db_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t]);
+is($res_db_reversesort, $res_reversesort_expected,
+	'correct descending sort (database)'
+);
+
+# test desc sort
+
+my $res_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($res_reversesort, $res_reversesort_expected,
+	'correct descending sort'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test reverse lower/upper (database)
+
+my $tcase_db_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ'),
+                upper('aBcDfgHiwXyZ')]);
+is($tcase_db_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase (database)'
+);
+
+# test reverse lower/upper
+
+my $tcase_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
new file mode 100644
index 0000000000..45a849b52d
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
@@ -0,0 +1,43 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collator_lib_hooks.c
+ *		Code for testing collation provider library hooks
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
+ *
+ * Test implementation of icu-like collation provider.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+
+#include "test_collator_lib_hooks.h"
+
+#ifdef USE_ICU
+static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
+#endif
+
+PG_MODULE_MAGIC;
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		ereport(ERROR, (errmsg("test_collator_lib_hooks must be loaded via shared_preload_libraries")));
+
+#ifdef USE_ICU
+	prev_get_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = test_get_icu_library;
+#endif
+}
diff --git a/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control
new file mode 100644
index 0000000000..a0b8e031a4
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control
@@ -0,0 +1,4 @@
+comment = 'Test code for collation provider library hooks'
+default_version = '1.0'
+module_pathname = '$libdir/test_collation_lib_hooks'
+
diff --git a/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
new file mode 100644
index 0000000000..c3a9ff45e2
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
@@ -0,0 +1,28 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collator_lib_hooks.h
+ *		Definitions for collation library hooks.
+ *
+ * Copyright (c) 2015-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#ifndef TEST_COLLATOR_LIB_HOOKS_H
+#define TEST_COLLATOR_LIB_HOOKS_H
+
+#include "postgres.h"
+
+#include "utils/memutils.h"
+#include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
+
+#ifdef USE_ICU
+extern pg_icu_library *test_get_icu_library(Oid collid, const char *locale,
+											const char *version);
+#endif
+
+#endif
diff --git a/src/test/modules/test_collator_lib_hooks/test_icu_hook.c b/src/test/modules/test_collator_lib_hooks/test_icu_hook.c
new file mode 100644
index 0000000000..423ff30b5b
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_icu_hook.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_icu_hook.c
+ *		Code for testing collation provider icu hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collator_lib_hooks/test_icu_hook.c
+ *
+ * Implements a custom icu-like collation provider library for testing the
+ * hooks. It accepts any collator name requested. All behave exactly like the
+ * "en_US" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 2.72, so loading it will cause a version
+ * mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collator_lib_hooks.h"
+
+#ifdef USE_ICU
+
+#include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
+
+#define TEST_LOCALE "en_US"
+
+typedef struct TestUCollator {
+	UCollator	*ucol;
+	bool		 reverse;
+} TestUCollator;
+
+static pg_icu_library *test_icu_library = NULL;
+static const UVersionInfo test_icu_version = { 2, 72 };
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static UCollator *
+test_openCollator(const char *loc, UErrorCode *status)
+{
+	TestUCollator *testcol = MemoryContextAlloc(TopMemoryContext, sizeof(TestUCollator));
+	UCollator *ucol = ucol_open(TEST_LOCALE, status);
+	testcol->ucol = ucol;
+	testcol->reverse = locale_is_reverse(loc);
+	return (UCollator *)testcol;
+}
+
+static void
+test_closeCollator(UCollator *coll)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_close(testcol->ucol);
+	pfree(testcol);
+}
+
+static void
+test_setAttribute(UCollator *coll, UColAttribute attr,
+				  UColAttributeValue value, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_setAttribute(testcol->ucol, attr, value, status);
+}
+
+static void
+test_getCollatorVersion(const UCollator *coll, UVersionInfo info)
+{
+	memcpy(info, test_icu_version, sizeof(UVersionInfo));
+}
+
+static UCollationResult
+test_strcoll(const UCollator *coll, const UChar *source, int32_t sourceLength,
+			 const UChar *target, int32_t targetLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcoll(testcol->ucol, source, sourceLength,
+										target, targetLength);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static UCollationResult
+test_strcollUTF8(const UCollator *coll, const char *source,
+				 int32_t sourceLength, const char *target,
+				 int32_t targetLength, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcollUTF8(testcol->ucol, source,
+											sourceLength, target,
+											targetLength, status);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static int32_t
+test_getSortKey(const UCollator *coll, const UChar *source,
+				int32_t sourceLength, uint8_t *result, int32_t resultLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_getSortKey(testcol->ucol, source, sourceLength,
+								  result, resultLength);
+	size_t result_size = ret + 1;
+
+	if (resultLength >= result_size)
+	{
+		result[resultLength] = '\0';
+
+		if (testcol->reverse)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) result + i) ^= (unsigned char) 0xff;
+	}
+
+	return result_size;
+}
+
+static int32_t
+test_nextSortKeyPart(const UCollator *coll, UCharIterator *iter,
+					 uint32_t state[2], uint8_t *dest, int32_t count,
+					 UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_nextSortKeyPart(testcol->ucol, iter, state, dest,
+									   count, status);
+
+	if (testcol->reverse)
+		for (int i = 0; i < ret; i++)
+			*((unsigned char *) dest + i) ^= (unsigned char) 0xff;
+
+	/*
+	 * The following is not correct for cases where we finish precisely on the
+	 * boundary (i.e. count is exactly enough). To fix this we'd need to track
+	 * additional state across calls, which doesn't seem worth it for a test
+	 * case.
+	 */
+	if (count >= ret && ret > 0)
+	{
+		if (testcol->reverse)
+			dest[ret] = 0xff;
+		else
+			dest[ret] = '\0';
+		return ret + 1;
+	}
+
+	return ret;
+}
+
+static int32_t
+test_strToUpper(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+static int32_t
+test_strToLower(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+pg_icu_library *
+test_get_icu_library(Oid collid, const char *locale, const char *version)
+{
+	pg_icu_library *lib;
+
+	if (test_icu_library != NULL)
+		return test_icu_library;
+
+	ereport(LOG, (errmsg("loading custom ICU provider for test_collator_lib_hooks")));
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = test_openCollator;
+	lib->closeCollator = test_closeCollator;
+	lib->getCollatorVersion = test_getCollatorVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = test_strcoll;
+	lib->strcollUTF8 = test_strcollUTF8;
+	lib->getSortKey = test_getSortKey;
+	lib->nextSortKeyPart = test_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = test_strToUpper;
+	lib->strToLower = test_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = test_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	test_icu_library = lib;
+	return lib;
+}
+
+#endif				/* USE_ICU */
-- 
2.34.1

v7-0003-Add-initdb-option-no-import-collations.patchtext/x-patch; charset=UTF-8; name=v7-0003-Add-initdb-option-no-import-collations.patchDownload

From bb987434b9aed004998ac0fd11e82ecc0923a479 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 18 Jan 2023 12:32:56 -0800
Subject: [PATCH v7 3/4] Add initdb option --no-import-collations.

In preparation for the icu_multilib contrib module, which allows
greater control over how ICU collations and how different ICU library
versions are managed and used.

By not creating the collations during initdb time, the user will be
able to run pg_import_system_collations() themselves using a specific
version of ICU.
---
 doc/src/sgml/ref/initdb.sgml | 11 +++++++++++
 src/bin/initdb/initdb.c      | 10 +++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..5a2da829fb 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -305,6 +305,17 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-import-collations">
+      <term><option>--no-import-collations</option></term>
+      <listitem>
+       <para>
+        Prevents <command>initdb</command> from importing collations from the
+        system. This can be used to more carefully control which collations
+        are initially loaded in each database.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-sync">
       <term><option>-N</option></term>
       <term><option>--no-sync</option></term>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..acaeb00086 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -135,6 +135,7 @@ static char *lc_time = NULL;
 static char *lc_messages = NULL;
 static char locale_provider = COLLPROVIDER_LIBC;
 static char *icu_locale = NULL;
+static bool import_system_collations = true;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
 static bool pwprompt = false;
@@ -1491,7 +1492,8 @@ setup_collation(FILE *cmdfd)
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, PG_UTF8);
 
 	/* Now import all collations we can find in the operating system */
-	PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n");
+	if (import_system_collations)
+		PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n");
 }
 
 /*
@@ -2116,6 +2118,8 @@ usage(const char *progname)
 	printf(_("      --no-locale           equivalent to --locale=C\n"));
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                            set default locale provider for new databases\n"));
+	printf(_("      --no-import-collations\n"
+			 "                            do not import collations from the system"));
 	printf(_("      --pwfile=FILE         read password for the new superuser from file\n"));
 	printf(_("  -T, --text-search-config=CFG\n"
 			 "                            default text search configuration\n"));
@@ -2767,6 +2771,7 @@ main(int argc, char *argv[])
 		{"discard-caches", no_argument, NULL, 14},
 		{"locale-provider", required_argument, NULL, 15},
 		{"icu-locale", required_argument, NULL, 16},
+		{"no-import-collations", no_argument, NULL, 17},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -2924,6 +2929,9 @@ main(int argc, char *argv[])
 			case 16:
 				icu_locale = pg_strdup(optarg);
 				break;
+			case 17:
+				import_system_collations = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
-- 
2.34.1

v7-0004-icu_multilib-contrib-extension-for-managing-ICU-l.patchtext/x-patch; charset=UTF-8; name=v7-0004-icu_multilib-contrib-extension-for-managing-ICU-l.patchDownload

From cdcce0e1f8dcb28fae8b7ef8413e70bd71e4a44d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Sat, 14 Jan 2023 12:17:05 -0800
Subject: [PATCH v7 4/4] icu_multilib: contrib extension for managing ICU
 libraries.

The icu_multilib module provides control over the version (or
versions) of the ICU provider library used by PostgreSQL, which can be
different from the version of ICU with which it was built.

Control over the specific ICU library is important, because any
differences in collation may result in corrupt indexes, and possibly
other problems.

The primary use case is the ability to stabilize the ICU version
("lock" it to a specific version), even if the built-in ICU library
changes. It can be "locked" to a specific major and minor version (the
safest option); or to a major version only, enabling non-disruptive
minor version updates of ICU (albeit with some risk of subtle
differences in collation behavior).

If enabled, icu_multilib also provides support for multiple ICU
libraries by using "search by collator version" logic. This logic
assumes that identical collator versions (as provided by ICU) mean
identical behavior, which also carries some risk.

Author: Jeff Davis, Thomas Munro
---
 contrib/Makefile                           |   6 +
 contrib/icu_multilib/Makefile              |  21 +
 contrib/icu_multilib/icu_multilib--1.0.sql |  43 +
 contrib/icu_multilib/icu_multilib.c        | 880 +++++++++++++++++++++
 contrib/icu_multilib/icu_multilib.control  |   7 +
 contrib/icu_multilib/meson.build           |  24 +
 contrib/meson.build                        |   3 +
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/icu-multilib.sgml             | 496 ++++++++++++
 10 files changed, 1482 insertions(+)
 create mode 100644 contrib/icu_multilib/Makefile
 create mode 100644 contrib/icu_multilib/icu_multilib--1.0.sql
 create mode 100644 contrib/icu_multilib/icu_multilib.c
 create mode 100644 contrib/icu_multilib/icu_multilib.control
 create mode 100644 contrib/icu_multilib/meson.build
 create mode 100644 doc/src/sgml/icu-multilib.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index bbf220407b..06447d9346 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -65,6 +65,12 @@ else
 ALWAYS_SUBDIRS += uuid-ossp
 endif
 
+ifeq ($(with_icu),yes)
+SUBDIRS += icu_multilib
+else
+ALWAYS_SUBDIRS += icu_multilib
+endif
+
 ifeq ($(with_libxml),yes)
 SUBDIRS += xml2
 else
diff --git a/contrib/icu_multilib/Makefile b/contrib/icu_multilib/Makefile
new file mode 100644
index 0000000000..4823007abe
--- /dev/null
+++ b/contrib/icu_multilib/Makefile
@@ -0,0 +1,21 @@
+# contrib/icu_multilib/Makefile
+
+MODULE_big = icu_multilib
+OBJS = \
+	$(WIN32RES) \
+	icu_multilib.o
+
+EXTENSION = icu_multilib
+DATA = icu_multilib--1.0.sql
+PGFILEDESC = "icu_multilib -- support multiple ICU library versions"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/icu_multilib
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/icu_multilib/icu_multilib--1.0.sql b/contrib/icu_multilib/icu_multilib--1.0.sql
new file mode 100644
index 0000000000..6751242bb4
--- /dev/null
+++ b/contrib/icu_multilib/icu_multilib--1.0.sql
@@ -0,0 +1,43 @@
+/* contrib/icu_multilib/icu_multilib--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION icu_multilib" to load this file. \quit
+
+CREATE FUNCTION library_versions(
+    icu_version OUT TEXT,
+    unicode_version OUT TEXT,
+    cldr_version OUT TEXT,
+    libicui18n_name OUT TEXT,
+    libicuuc_name OUT TEXT
+  )
+  RETURNS SETOF RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'library_versions';
+
+CREATE FUNCTION library_collators(
+    major_version IN TEXT DEFAULT NULL,
+    locale OUT TEXT,
+    icu_version OUT TEXT,
+    uca_version OUT TEXT,
+    collator_version OUT TEXT
+  )
+  RETURNS SETOF RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'library_collators';
+
+CREATE FUNCTION collator_version_search (
+    locale IN TEXT,
+    requested_version IN TEXT DEFAULT NULL,
+    log_ok IN BOOLEAN DEFAULT FALSE,
+    icu_version OUT TEXT,
+    uca_version OUT TEXT,
+    collator_version OUT TEXT)
+  RETURNS RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'collator_version_search';
+
+CREATE FUNCTION collator_versions (
+    locale IN TEXT,
+    icu_version OUT TEXT,
+    uca_version OUT TEXT,
+    collator_version OUT TEXT
+  )
+  RETURNS SETOF RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'collator_versions';
diff --git a/contrib/icu_multilib/icu_multilib.c b/contrib/icu_multilib/icu_multilib.c
new file mode 100644
index 0000000000..2fde2a85d5
--- /dev/null
+++ b/contrib/icu_multilib/icu_multilib.c
@@ -0,0 +1,880 @@
+
+#include "postgres.h"
+
+#include <dlfcn.h>
+#include <limits.h>
+
+#include "common/string.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_locale_internal.h"
+
+#ifndef USE_ICU
+#error "ICU support is required to build icu_multilib"
+#endif
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR U_ICU_VERSION_MAJOR_NUM
+
+/*
+ * The oldest ICU release we're likely to encounter, and that has all the
+ * funcitons required.
+ */
+#define PG_MIN_ICU_MAJOR 50
+
+/*
+ * Enough to hold entries for minimum to maximum supported ICU versions, and
+ * also the builtin ICU (if icu_multilib.include_builtin_icu is set).
+ */
+#define ICU_LIB_TABLESIZE (PG_MAX_ICU_MAJOR - PG_MIN_ICU_MAJOR + 2)
+
+typedef struct icu_version
+{
+	int major;
+	int minor;
+} icu_version;
+
+static const struct config_enum_entry log_level_options[] = {
+	{"debug5", DEBUG5, false},
+	{"debug4", DEBUG4, false},
+	{"debug3", DEBUG3, false},
+	{"debug2", DEBUG2, false},
+	{"debug1", DEBUG1, false},
+	{"debug", DEBUG2, true},
+	{"log", LOG, false},
+	{"info", INFO, true},
+	{"notice", NOTICE, false},
+	{"warning", WARNING, false},
+	{"error", ERROR, false},
+	{NULL, 0, false}
+};
+
+PG_MODULE_MAGIC;
+
+static get_icu_library_hook_type prev_icu_library_hook = NULL;
+
+static pg_icu_library *icu_library_table[ICU_LIB_TABLESIZE] = {};
+
+static char *icu_library_path = "";
+static char *default_icu_version = "";
+static icu_version default_version = { .major = -1, .minor = -1 };
+static bool search_by_collator_version = false;
+static bool include_builtin = true;
+static int version_mismatch_log_level = WARNING;
+static int library_search_log_level = DEBUG1;
+
+static void initialize_gucs_1(void);
+static void initialize_gucs_2(void);
+static pg_icu_library *icu_multilib_hook(Oid collid, const char *locale,
+										 const char *version);
+static pg_icu_library *icu_multilib_search(Oid collid, const char *locale,
+										   const char *version,
+										   bool logOk);
+static void load_all_libraries(void);
+static pg_icu_library *load_icu_library(int major);
+static pg_icu_library *get_icu_by_major_version(int major);
+static void get_library_version(const pg_icu_library *lib, int *major,
+								int *minor);
+
+static bool check_string_version(char **newval, void **extra, GucSource source);
+static void assign_default_version(const char *newval, void *extra);
+
+void
+_PG_init()
+{
+	/*TODO: error messages */
+	if (!process_shared_preload_libraries_in_progress)
+		elog(ERROR, "icu_multilib must be loaded by shared_preload_libraries");
+
+	prev_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = icu_multilib_hook;
+
+	initialize_gucs_1();
+	load_all_libraries();
+	initialize_gucs_2();
+}
+
+/* initialize GUCs before loading ICU libraries */
+static void
+initialize_gucs_1()
+{
+	/*
+	 * The library search path is initialized once per server start and not
+	 * changable -- this keeps the memory usage in TopMemoryContext bounded.
+	 */
+	DefineCustomStringVariable("icu_multilib.library_path",
+							   "Filesystem path where ICU libraries are installed.",
+							   NULL,
+							   &icu_library_path,
+							   "",
+							   PGC_POSTMASTER,
+							   0, NULL, NULL, NULL);
+	DefineCustomEnumVariable("icu_multilib.version_mismatch_log_level",
+							 "Level of log message when a collator version mismatch is detected.",
+							 NULL,
+							 &version_mismatch_log_level,
+							 WARNING,
+							 log_level_options,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+	DefineCustomEnumVariable("icu_multilib.library_search_log_level",
+							 "Level of log messages related to searching for an ICU library.",
+							 NULL,
+							 &library_search_log_level,
+							 DEBUG1,
+							 log_level_options,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+}
+
+/* initialize GUCs after loading ICU libraries */
+static void
+initialize_gucs_2()
+{
+	DefineCustomStringVariable("icu_multilib.default_icu_version",
+							   "The version of the default ICU library.",
+							   "Can be specified with major and minor versions, or major version only.",
+							   &default_icu_version,
+							   "",
+							   PGC_SUSET,
+							   0, check_string_version, assign_default_version, NULL);
+	DefineCustomBoolVariable("icu_multilib.include_builtin",
+							 "Include built-in ICU library when listing or searching libraries.",
+							 NULL,
+							 &include_builtin,
+							 true,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+	DefineCustomBoolVariable("icu_multilib.search_by_collator_version",
+							 "Enable searching for the ICU library based on the collator version.",
+							 NULL,
+							 &search_by_collator_version,
+							 false,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+}
+
+static bool
+parse_icu_version(const char *version, icu_version *out_version)
+{
+	size_t	input_len = strlen(version);
+	int		parse_len;
+	int		parse_major;
+	int		parse_minor;
+
+	if (sscanf(version, "%d.%d%n",
+			   &parse_major, &parse_minor, &parse_len) == 2 &&
+		input_len == parse_len)
+	{
+		out_version->major = parse_major;
+		out_version->minor = parse_minor;
+		return true;
+	}
+	else if (sscanf(version, "%d%n", &parse_major, &parse_len) == 1 &&
+			 input_len == parse_len)
+	{
+		out_version->major = parse_major;
+		out_version->minor = -1;
+		return true;
+	}
+
+	return false;
+}
+
+static bool
+check_string_version(char **newval, void **extra, GucSource source)
+{
+	pg_icu_library	*lib;
+	icu_version		*myextra;
+	icu_version		 parsed_version;
+
+	if (*newval == NULL || *newval[0] == '\0')
+		return true;
+
+	if (!parse_icu_version(*newval, &parsed_version))
+	{
+		GUC_check_errmsg("error parsing ICU version \"%s\"", *newval);
+		GUC_check_errhint("The ICU version should be either in the form \"major.minor\", "
+						  "or just be a plain major version number.");
+		return false;
+	}
+
+	if (parsed_version.major < PG_MIN_ICU_MAJOR ||
+		parsed_version.major > PG_MAX_ICU_MAJOR)
+	{
+		GUC_check_errmsg("major version %d is out of range",
+						 parsed_version.major);
+		GUC_check_errhint("ICU major version must be between %d and %d.",
+						  PG_MIN_ICU_MAJOR, PG_MAX_ICU_MAJOR);
+		return false;
+	}
+
+	lib = get_icu_by_major_version(parsed_version.major);
+	if (!lib)
+	{
+		GUC_check_errmsg("ICU library with major version %d not found",
+						 parsed_version.major);
+		return false;
+	}
+
+	if (parsed_version.minor != -1 && lib->minor_version != parsed_version.minor)
+	{
+		GUC_check_errmsg("ICU library with major version %d has minor version %d; "
+						 "expected minor version %d",
+						 lib->major_version, lib->minor_version,
+						 parsed_version.minor);
+		return false;
+	}
+
+	myextra = guc_malloc(ERROR, sizeof(icu_version));
+	*myextra = parsed_version;
+	*extra = myextra;
+
+	return true;
+}
+
+static void
+assign_default_version(const char *newval, void *extra)
+{
+	icu_version *myextra = (icu_version *) extra;
+
+	if (myextra == NULL)
+		return;
+
+	default_version = *myextra;
+}
+
+static void
+load_all_libraries()
+{
+	pg_icu_library *builtin = get_builtin_icu_library();
+
+	icu_library_table[ICU_LIB_TABLESIZE - 1] = builtin;
+
+	ereport(LOG, (errmsg("icu_multilib: retrieved built-in ICU version %d.%d",
+						 builtin->major_version, builtin->minor_version)));
+
+	for (int major = PG_MAX_ICU_MAJOR; major >= PG_MIN_ICU_MAJOR; major--)
+		icu_library_table[major - PG_MIN_ICU_MAJOR] = load_icu_library(major);
+}
+
+static pg_icu_library *
+get_icu_by_major_version(int major)
+{
+	pg_icu_library *lib;
+	if (major < PG_MIN_ICU_MAJOR || major > PG_MAX_ICU_MAJOR)
+		ereport(ERROR, (errmsg("major version %d is out of range", major)));
+
+	lib = icu_library_table[major - PG_MIN_ICU_MAJOR];
+	return lib;
+}
+
+static void
+get_library_version(const pg_icu_library *lib, int *major, int *minor)
+{
+	UVersionInfo version_info;
+	lib->getICUVersion(version_info);
+	*major = version_info[0];
+	*minor = version_info[1];
+	return;
+}
+
+/*
+ * Fill in out_version (which must have U_MAX_VERSION_STRING_LENGTH bytes
+ * available) with the collator version of the given locale in the given
+ * library. Return false if the collator is not found.
+ */
+static bool
+lib_collator_version(pg_icu_library *lib, const char *locale,
+					 char *out_version)
+{
+	UCollator		*collator;
+	UVersionInfo	 version_info;
+	UErrorCode		 status;
+
+	status = U_ZERO_ERROR;
+	collator = lib->openCollator(locale, &status);
+
+	if (!collator)
+		return false;
+
+	lib->getCollatorVersion(collator, version_info);
+	lib->versionToString(version_info, out_version);
+	lib->closeCollator(collator);
+
+	return true;
+}
+
+/*
+ * Find the right ICU library for the given locale and version. The resulting
+ * library may or may not provide a collator with an exactly-matching
+ * version.
+ *
+ * If search_by_collator_version is set, scan the table (first the built-in
+ * ICU library, then descending order of major versions) to find the first
+ * library that provides a collator of the given locale with a matching
+ * version.
+ *
+ * If no exactly matching version is found, and default_major_version is set,
+ * return the default library.
+ *
+ * Otherwise fall back to the built-in library.
+ */
+static pg_icu_library *
+icu_multilib_hook(Oid collid, const char *locale,
+				  const char *requested_version)
+{
+	return icu_multilib_search(collid, locale, requested_version, true);
+}
+
+/*
+ *
+ */
+static pg_icu_library *
+icu_multilib_search(Oid collid, const char *locale,
+					const char *requested_version, bool logOk)
+{
+	char			 actual_version[U_MAX_VERSION_STRING_LENGTH];
+	pg_icu_library	*found_lib = NULL;
+
+	/*
+	 * If another hook was set first, defer to that unless it returns NULL or
+	 * a library that doesn't contain the given collator at all. This may
+	 * result in a mismatching collator version, but we don't want to
+	 * speculate about what's better or worse in the presence of other hooks.
+	 */
+	if (prev_icu_library_hook)
+	{
+		pg_icu_library *tmp_lib;
+		tmp_lib = prev_icu_library_hook(collid, locale, requested_version);
+		if (tmp_lib && lib_collator_version(tmp_lib, locale, actual_version))
+			found_lib = tmp_lib;
+	}
+
+	if (!found_lib && search_by_collator_version && requested_version != NULL)
+	{
+		/*
+		 * Search from newest library to oldest for a matching version of the
+		 * collator with the given name.
+		 */
+		for (int i = ICU_LIB_TABLESIZE - 1; i >= 0; i--)
+		{
+			char			 tmp_version[U_MAX_VERSION_STRING_LENGTH];
+			pg_icu_library	*tmp_lib = icu_library_table[i];
+
+			if (tmp_lib == NULL)
+				continue;
+
+			if (!include_builtin && i == ICU_LIB_TABLESIZE - 1)
+				continue;
+
+			if (lib_collator_version(tmp_lib, locale, tmp_version) &&
+				strcmp(requested_version, tmp_version) == 0)
+			{
+				strcpy(actual_version, tmp_version);
+				found_lib = tmp_lib;
+				break;
+			}
+		}
+	}
+
+	if (!found_lib && default_version.major != -1)
+	{
+		pg_icu_library *tmp_lib;
+		tmp_lib = get_icu_by_major_version(default_version.major);
+		if (!tmp_lib)
+			ereport(ERROR,
+					(errmsg("icu_multilib: default major version %d not found",
+							default_version.major)));
+		if (lib_collator_version(tmp_lib, locale, actual_version))
+			found_lib = tmp_lib;
+		else if (logOk)
+			ereport(library_search_log_level,
+					(errmsg("icu_multilib: found default ICU %d.%d, but collator \"%s\" not found",
+							tmp_lib->major_version, tmp_lib->minor_version, locale)));
+	}
+
+	if (!found_lib && include_builtin)
+	{
+		pg_icu_library *tmp_lib;
+		tmp_lib = icu_library_table[ICU_LIB_TABLESIZE - 1];
+		if (lib_collator_version(tmp_lib, locale, actual_version))
+			found_lib = tmp_lib;
+		else if (logOk)
+			ereport(library_search_log_level,
+					(errmsg("icu_multilib: found built-in ICU %d.%d, but collator \"%s\" not found",
+							tmp_lib->major_version, tmp_lib->minor_version, locale)));
+	}
+
+	/* if not found, fall back to built-in or other hook */
+	if (!found_lib)
+		return NULL;
+
+	if (logOk)
+		ereport(library_search_log_level,
+				(errmsg("icu_multilib: found ICU version %d.%d providing collator version \"%s\" for locale \"%s\"",
+						found_lib->major_version, found_lib->minor_version,
+						actual_version, locale)));
+
+	/*
+	 * This is somewhat redundant with a similar warning in pg_locale.c, but
+	 * it provides details about the locale name and ICU version, which is
+	 * helpful when multiple ICU libraries are in use.
+	 */
+	if (requested_version && logOk &&
+		strcmp(requested_version, actual_version) != 0)
+	{
+		ereport(version_mismatch_log_level,
+				(errmsg("icu_multilib: collator version mismatch detected for locale \"%s\"",
+						locale),
+				 errdetail("ICU %d.%d provides collator version \"%s\" for locale \"%s\"; expected version \"%s\".",
+						   found_lib->major_version, found_lib->minor_version,
+						   actual_version, locale, requested_version)));
+	}
+
+	return found_lib;
+}
+
+PG_FUNCTION_INFO_V1(library_versions);
+Datum
+library_versions(PG_FUNCTION_ARGS)
+{
+#define LIBRARY_VERSIONS_COLS 5
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum           values[LIBRARY_VERSIONS_COLS];
+	bool            nulls[LIBRARY_VERSIONS_COLS];
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (int i = ICU_LIB_TABLESIZE - 1; i >= 0; i--)
+	{
+		UErrorCode      status;
+		UVersionInfo version_info;
+		char            version_string[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*lib = icu_library_table[i];
+
+		if (lib == NULL)
+			continue;
+
+		if (!include_builtin && i == ICU_LIB_TABLESIZE - 1)
+			continue;
+
+		lib->getICUVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[0] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[0] = false;
+
+		lib->getUnicodeVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[1] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[1] = false;
+		status = U_ZERO_ERROR;
+		lib->getCLDRVersion(version_info, &status);
+		if (U_SUCCESS(status))
+		{
+			lib->versionToString(version_info, version_string);
+			values[2] = PointerGetDatum(cstring_to_text(version_string));
+			nulls[2] = false;
+		}
+		else
+		{
+			nulls[2] = true;
+		}
+
+		values[3] = PointerGetDatum(cstring_to_text(lib->libicui18n_name));
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(cstring_to_text(lib->libicuuc_name));
+		nulls[4] = false;
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+#define COLLATOR_DETAIL_COLS 3
+static void
+make_collator_detail_record(pg_icu_library *lib, const char *locale,
+							Datum *values, bool *nulls)
+{
+	UErrorCode		 status;
+	UCollator		*collator;
+	UVersionInfo	 version_info;
+	char			 version_string[U_MAX_VERSION_STRING_LENGTH];
+
+	status = U_ZERO_ERROR;
+	collator = lib->openCollator(locale, &status);
+
+	lib->getICUVersion(version_info);
+	lib->versionToString(version_info, version_string);
+	values[0] = PointerGetDatum(cstring_to_text(version_string));
+	nulls[0] = false;
+
+	if (collator)
+	{
+		lib->getUCAVersion(collator, version_info);
+		lib->versionToString(version_info, version_string);
+		values[1] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[1] = false;
+	}
+	else
+		nulls[1] = true;
+
+	if (collator)
+	{
+		lib->getCollatorVersion(collator, version_info);
+		lib->versionToString(version_info, version_string);
+		values[2] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[2] = false;
+	}
+	else
+		nulls[2] = true;
+
+	if (collator)
+		lib->closeCollator(collator);
+
+	return;
+}
+
+PG_FUNCTION_INFO_V1(library_collators);
+Datum
+library_collators(PG_FUNCTION_ARGS)
+{
+#define LIBRARY_COLLATORS_COLS (COLLATOR_DETAIL_COLS + 1)
+	pg_icu_library	*lib  = NULL;
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum			 values[LIBRARY_COLLATORS_COLS];
+	bool			 nulls[LIBRARY_COLLATORS_COLS];
+
+	if (PG_ARGISNULL(0))
+	{
+		lib = icu_library_table[ICU_LIB_TABLESIZE - 1];
+	}
+	else
+	{
+		char *major_str = text_to_cstring(PG_GETARG_TEXT_PP(0));
+		int major = atoi(major_str); /* TODO? */
+
+		if (major < PG_MIN_ICU_MAJOR || major > PG_MAX_ICU_MAJOR)
+			ereport(ERROR, (errmsg("icu_multilib: major version %d is out of range", major)));
+		lib = get_icu_by_major_version(major);
+		if (!lib)
+			ereport(ERROR, (errmsg("icu_multilib: major version %d not found", major)));
+	}
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (int i = -1; i < lib->countAvailable(); i++)
+	{
+		const char	*name;
+		const char	*locale;
+		UErrorCode	 status;
+		char		 langtag[ULOC_FULLNAME_CAPACITY];
+
+		if (i == -1)
+			name = "";		/* ICU root locale */
+		else
+			name = lib->getAvailable(i);
+
+		status = U_ZERO_ERROR;
+		lib->toLanguageTag(name, langtag, sizeof(langtag), true, &status);
+
+		if (U_FAILURE(status))
+			continue;
+
+		locale = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
+
+		if (!pg_is_ascii(langtag) || !pg_is_ascii(locale))
+			continue;
+
+		values[0] = PointerGetDatum(cstring_to_text(locale));
+		nulls[0] = false;
+
+		make_collator_detail_record(lib, locale, values + 1, nulls + 1);
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+PG_FUNCTION_INFO_V1(collator_version_search);
+Datum
+collator_version_search(PG_FUNCTION_ARGS)
+{
+#define COLLATOR_VERSION_SEARCH_COLS COLLATOR_DETAIL_COLS
+	const char		*locale;
+	const char		*requested_version = NULL;
+	bool			 logOk;
+	pg_icu_library	*lib;
+	int				 major, minor;
+	TupleDesc		 tupdesc;
+	HeapTuple		 tuple;
+	Datum			 values[COLLATOR_VERSION_SEARCH_COLS];
+	bool			 nulls[COLLATOR_VERSION_SEARCH_COLS];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR, (errmsg("locale argument must be non-NULL")));
+	if (PG_ARGISNULL(2))
+		ereport(ERROR, (errmsg("log_ok argument must be non-NULL")));
+
+	locale = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	if (!PG_ARGISNULL(1))
+		requested_version = text_to_cstring(PG_GETARG_TEXT_PP(1));
+	logOk = PG_GETARG_BOOL(2);
+
+	lib = icu_multilib_search(InvalidOid, locale, requested_version, logOk);
+	get_library_version(lib, &major, &minor);
+
+	make_collator_detail_record(lib, locale, values, nulls);
+
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+PG_FUNCTION_INFO_V1(collator_versions);
+Datum
+collator_versions(PG_FUNCTION_ARGS)
+{
+#define COLLATOR_VERSIONS_COLS COLLATOR_DETAIL_COLS
+	const char *locale;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum           values[COLLATOR_VERSIONS_COLS];
+	bool            nulls[COLLATOR_VERSIONS_COLS];
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR, (errmsg("locale argument must be non-NULL")));
+	locale = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (int i = ICU_LIB_TABLESIZE - 1; i >= 0; i--)
+	{
+		pg_icu_library	*lib = icu_library_table[i];
+
+		if (lib == NULL)
+			continue;
+
+		if (!include_builtin && i == ICU_LIB_TABLESIZE - 1)
+			continue;
+
+		make_collator_detail_record(lib, locale, values, nulls);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+static void
+make_icu_library_names(int major, char **libicui18n, char **libicuuc)
+{
+	char libicui18n_tmp[MAXPGPATH];
+	char libicuuc_tmp[MAXPGPATH];
+
+	/*
+	 * See
+	 * https://unicode-org.github.io/icu/userguide/icu4c/packaging.html#icu-versions
+	 * for conventions on library naming on POSIX and Windows systems.  Apple
+	 * isn't mentioned but varies in the usual way.
+	 *
+	 * Format 1 is expected to be a major version-only symlink pointing to a
+	 * specific minor version (or on Windows it may be the actual library).
+	 * Format 2 is expected to be an actual library.
+	 */
+#ifdef WIN32
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%sicu%s%d" DLSUFFIX
+#elif defined(__darwin__)
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%slibicu%s.%d" DLSUFFIX
+#else
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%slibicu%s" DLSUFFIX ".%d"
+#endif
+
+#ifdef WIN32
+#define PATH_SEPARATOR "\\"
+#define ICU_I18N "in"
+#define ICU_UC "uc"
+#else
+#define PATH_SEPARATOR "/"
+#define ICU_I18N "i18n"
+#define ICU_UC "uc"
+#endif
+
+	snprintf(libicui18n_tmp,
+			 MAXPGPATH,
+			 ICU_LIBRARY_NAME_FORMAT1,
+			 icu_library_path, icu_library_path[0] ? PATH_SEPARATOR : "",
+			 "i18n", major);
+	snprintf(libicuuc_tmp,
+			 MAXPGPATH,
+			 ICU_LIBRARY_NAME_FORMAT1,
+			 icu_library_path, icu_library_path[0] ? PATH_SEPARATOR : "",
+			 "uc", major);
+
+	*libicui18n = pstrdup(libicui18n_tmp);
+	*libicuuc = pstrdup(libicuuc_tmp);
+}
+
+#define MAX_SYMBOL_LEN 128
+static void *
+load_icu_function(void *handle, const char *function, int major)
+{
+	char     function_with_version[MAX_SYMBOL_LEN];
+	int		 ret;
+	void    *result;
+
+	/*
+	 * Try to look up the symbol with the library major version as a suffix.
+	 */
+	ret = snprintf(function_with_version, sizeof(function_with_version),
+				   "%s_%d", function, major);
+	if (ret >= sizeof(function_with_version))
+		elog(ERROR, "icu_multilib: cannot construct symbol name");
+	result = dlsym(handle, function_with_version);
+
+	/*
+	 * Library may have been configured with --disable-renaming, try without
+	 * major version suffix.
+	 */
+	if (result == NULL)
+		result = dlsym(handle, function);
+
+	return result;
+}
+
+#define LOAD_FUNC(DEST, LIBHANDLE, LIBNAME, FNAME)	\
+	do {										\
+		DEST = load_icu_function((LIBHANDLE), (FNAME), major);	\
+		if (DEST == NULL)						\
+		{										\
+			ereport(WARNING,					\
+					(errmsg("icu_multilib: could not find symbol \"%s\" in library \"%s\"",	\
+							(FNAME), (LIBNAME))));	\
+			goto err;							\
+		}										\
+	} while(0)
+
+#define LOAD_FUNC_I18N(DEST, FNAME) \
+	LOAD_FUNC(DEST, libicui18n_handle, libicui18n, FNAME)
+#define LOAD_FUNC_UC(DEST, FNAME) \
+	LOAD_FUNC(DEST, libicuuc_handle, libicuuc, FNAME)
+
+static pg_icu_library *
+load_icu_library(int major)
+{
+	UVersionInfo	 version_info;
+	pg_icu_library	*lib;
+	char			*libicui18n;
+	char			*libicuuc;
+	void			*libicui18n_handle = NULL;
+	void			*libicuuc_handle   = NULL;
+
+	make_icu_library_names(major, &libicui18n, &libicuuc);
+
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_icu_library));
+
+	libicui18n_handle = dlopen(libicui18n, RTLD_NOW | RTLD_LOCAL);
+	if (!libicui18n_handle)
+		return NULL;
+
+	/* Load the common library. */
+	libicuuc_handle = dlopen(libicuuc, RTLD_NOW | RTLD_LOCAL);
+	if (!libicui18n_handle)
+	{
+		elog(WARNING, "found ICU library \"%s\" but not companion library \"%s\"",
+			 libicui18n, libicuuc);
+		dlclose(libicui18n_handle);
+		return NULL;
+	}
+
+	/*
+	 * We only allocate the pg_icu_library object after successfully
+	 * opening the libraries to minimize the work done in the ENOENT case,
+	 * when probing a range of versions.  That means we might need to
+	 * clean up on allocation failure.
+	 */
+	lib = MemoryContextAllocExtended(TopMemoryContext, sizeof(*lib),
+									 MCXT_ALLOC_NO_OOM);
+	lib->libicui18n_name = MemoryContextAllocExtended(
+		TopMemoryContext, strlen(libicui18n) + 1, MCXT_ALLOC_NO_OOM);
+	lib->libicuuc_name = MemoryContextAllocExtended(
+		TopMemoryContext, strlen(libicuuc) + 1, MCXT_ALLOC_NO_OOM);
+
+	if (!lib || !lib->libicui18n_name || !lib->libicuuc_name)
+	{
+		dlclose(libicui18n_handle);
+		dlclose(libicuuc_handle);
+		elog(ERROR, "out of memory");
+	}
+
+	strcpy(lib->libicui18n_name, libicui18n);
+	strcpy(lib->libicuuc_name, libicuuc);
+
+	pfree(libicui18n);
+	pfree(libicuuc);
+
+	/* try to find all the symbols we need from the i18n library */
+	LOAD_FUNC_I18N(lib->getICUVersion, "u_getVersion");
+	LOAD_FUNC_I18N(lib->getUnicodeVersion, "u_getUnicodeVersion");
+	LOAD_FUNC_I18N(lib->getCLDRVersion, "ulocdata_getCLDRVersion");
+	LOAD_FUNC_I18N(lib->openCollator, "ucol_open");
+	LOAD_FUNC_I18N(lib->closeCollator, "ucol_close");
+	LOAD_FUNC_I18N(lib->getCollatorVersion, "ucol_getVersion");
+	LOAD_FUNC_I18N(lib->getUCAVersion, "ucol_getUCAVersion");
+	LOAD_FUNC_I18N(lib->versionToString, "u_versionToString");
+	LOAD_FUNC_I18N(lib->strcoll, "ucol_strcoll");
+	LOAD_FUNC_I18N(lib->strcollUTF8, "ucol_strcollUTF8");
+	LOAD_FUNC_I18N(lib->getSortKey, "ucol_getSortKey");
+	LOAD_FUNC_I18N(lib->nextSortKeyPart, "ucol_nextSortKeyPart");
+	LOAD_FUNC_I18N(lib->setUTF8, "uiter_setUTF8");
+	LOAD_FUNC_I18N(lib->errorName, "u_errorName");
+	LOAD_FUNC_I18N(lib->setAttribute, "ucol_setAttribute");
+
+	/* try to find all the symbols we need from the uc library */
+	LOAD_FUNC_UC(lib->strToUpper, "u_strToUpper");
+	LOAD_FUNC_UC(lib->strToLower, "u_strToLower");
+	LOAD_FUNC_UC(lib->strToTitle, "u_strToTitle");
+	LOAD_FUNC_UC(lib->openConverter, "ucnv_open");
+	LOAD_FUNC_UC(lib->closeConverter, "ucnv_close");
+	LOAD_FUNC_UC(lib->fromUChars, "ucnv_fromUChars");
+	LOAD_FUNC_UC(lib->toUChars, "ucnv_toUChars");
+	LOAD_FUNC_UC(lib->toLanguageTag, "uloc_toLanguageTag");
+	LOAD_FUNC_UC(lib->getDisplayName, "uloc_getDisplayName");
+	LOAD_FUNC_UC(lib->countAvailable, "uloc_countAvailable");
+	LOAD_FUNC_UC(lib->getAvailable, "uloc_getAvailable");
+
+	lib->getICUVersion(version_info);
+	lib->major_version = version_info[0];
+	lib->minor_version = version_info[1];
+
+	if (lib->major_version != major)
+		ereport(version_mismatch_log_level,
+				(errmsg("loaded library for major version %d, but library reports major version %d",
+						major, lib->major_version)));
+
+	ereport(LOG, (errmsg("icu_multilib: loaded ICU version %d.%d",
+						 lib->major_version, lib->minor_version)));
+
+	return lib;
+
+err:
+	dlclose(libicui18n_handle);
+	dlclose(libicuuc_handle);
+	pfree(lib->libicui18n_name);
+	pfree(lib->libicuuc_name);
+	pfree(lib);
+	return NULL;
+}
diff --git a/contrib/icu_multilib/icu_multilib.control b/contrib/icu_multilib/icu_multilib.control
new file mode 100644
index 0000000000..8f2c546781
--- /dev/null
+++ b/contrib/icu_multilib/icu_multilib.control
@@ -0,0 +1,7 @@
+# icu_multlib extension
+comment = 'support multiple ICU library versions'
+default_version = '1.0'
+module_pathname = '$libdir/icu_multilib'
+relocatable = false
+schema = icu_multilib
+trusted = true
diff --git a/contrib/icu_multilib/meson.build b/contrib/icu_multilib/meson.build
new file mode 100644
index 0000000000..f2ea4088e4
--- /dev/null
+++ b/contrib/icu_multilib/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+icu_multilib_sources = files(
+  'icu_multilib.c',
+)
+
+if host_system == 'windows'
+  icu_multilib_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'icu_multilib',
+    '--FILEDESC', 'icu_multilib -- support multiple ICU library versions',])
+endif
+
+icu_multilib = shared_module('icu_multilib',
+  icu_multilib_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += icu_multilib
+
+install_data(
+  'icu_multilib.control',
+  'icu_multilib--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..03f72147cf 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -28,6 +28,9 @@ subdir('fuzzystrmatch')
 subdir('hstore')
 subdir('hstore_plperl')
 subdir('hstore_plpython')
+if icu.found()
+   subdir('icu_multilib')
+endif
 subdir('intagg')
 subdir('intarray')
 subdir('isn')
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 4e7b87a42f..469a1746b9 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -113,6 +113,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
+ &icu-multilib;
  &intagg;
  &intarray;
  &isn;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 0d6be9a2fa..091ce0a60b 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -131,6 +131,7 @@
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">
+<!ENTITY icu-multilib    SYSTEM "icu-multilib.sgml">
 <!ENTITY intagg          SYSTEM "intagg.sgml">
 <!ENTITY intarray        SYSTEM "intarray.sgml">
 <!ENTITY isn             SYSTEM "isn.sgml">
diff --git a/doc/src/sgml/icu-multilib.sgml b/doc/src/sgml/icu-multilib.sgml
new file mode 100644
index 0000000000..838872c880
--- /dev/null
+++ b/doc/src/sgml/icu-multilib.sgml
@@ -0,0 +1,496 @@
+<!-- doc/src/sgml/icu-multilib.sgml -->
+
+<sect1 id="icu-multilib" xreflabel="icu_multilib">
+ <title>icu_multilib</title>
+
+ <indexterm zone="icu-multilib">
+  <primary>icu_multilib</primary>
+ </indexterm>
+
+ <para>
+  The <filename>icu_multilib</filename> module provides control over the
+  version (or versions) of the ICU provider library used by
+  <productname>PostgreSQL</productname>, which can be different from the
+  version of ICU with which it was built.
+ </para>
+
+ <para>
+  Collations are a product of natural language, and natural language evolves
+  over time; but <productname>PostgreSQL</productname> depends on stable
+  ordering for structures such as indexes. Newer versions of ICU update the
+  provided collators to adapt to changes in natural language, so it's
+  important to control when and how those new versions of ICU are used to
+  prevent problems such as index corruption.
+ </para>
+
+ <para>
+  This module assumes that the necessary versions of ICU are already
+  available, such as through the operating system's package manager; and
+  already properly installed in a single location accessible to
+  <productname>PostgreSQL</productname>. The configration variable
+  <literal>icu_multilib.library_path</literal> should be set to the location
+  where these ICU library versions are installed.
+ </para>
+
+ <para>
+  <filename>icu_multilib</filename> must be loaded via
+  <literal>shared_preload_libraries</literal>.
+  <filename>icu_multilib</filename> ignores any ICU library with a major
+  version greater than that with which <productname>PostgreSQL</productname>
+  was built.
+ </para>
+
+ <sect2 id="icu-multilib-version-stability">
+  <title>ICU Version Stability</title>
+
+  <para>
+   The simplest way to use <filename>icu_multilib</filename> is to force ICU
+   version stability. Version stability means that
+   <productname>PostgreSQL</productname> will use the same version of ICU for
+   all collations, even if the operating system or
+   <productname>PostgreSQL</productname> itself is upgraded or moved. The
+   stable version may or may not be the same as that with which
+   <productname>PostgreSQL</productname> was built.
+  </para>
+
+  <para>
+   Version stability may be based on both the major and minor version of ICU,
+   such as <literal>68.2</literal>; or on the major version alone, such as
+   <literal>68</literal>. Specifying both the major and minor versions more
+   closely controls exactly which ICU library is loaded, but applying minor
+   version upgrades of the ICU library may interfere with normal
+   operation. Specifying the major version alone makes applying ICU minor
+   version updates simpler, but carries a risk of subtle differences in
+   collation order if the library is updated.
+  </para>
+
+  <para>
+   To configure for version stability, first make sure that
+   <varname>icu_multilib.library_path</varname> is set correctly, which you
+   can observe by executing the
+   <literal>icu_multilib.library_versions()</literal> function. Then set
+   <varname>icu_multilib.default_icu_version</varname> to the stable version,
+   and set <varname>icu_multilib.search_by_collator_version</varname> to
+   <literal>false</literal>. These configuration variables should be set in
+   <filename>postgresql.conf</filename> to ensure that the stable version of
+   ICU is used consistently.
+  </para>
+
+  <note>
+   <para>
+    Ensure that the stable version of the ICU library is always available,
+    including on all replicas and through operating system and
+    <productname>PostgreSQL</productname> upgrades.
+   </para>
+   <para>
+    Similarly, keep the settings in <filename>postgresql.conf</filename>, and
+    keep them consistent across replicas and upgrades.
+   </para>
+  </note>
+
+  <para>
+   For best results, stabilize the ICU version immediately after
+   <command>initdb</command> is run, to ensure that initial collations are
+   loaded from the stabilized version of the ICU library. To do this, use the
+   option <xref linkend="app-initdb-option-no-import-collations"/> for
+   <command>initdb</command>, import the collations manually, and then refresh
+   the versions for the default collations. For instance:
+  </para>
+
+<programlisting>
+$ initdb --no-import-collations -D data --locale-provider=icu --icu-locale="en_US" --locale="en_US.UTF-8"
+$ cat - >> data/postgresql.conf
+shared_preload_libraries = 'icu_multilib'
+icu_multilib.library_path = '/path/to/icu/lib'
+icu_multilib.default_icu_version = '65.1'
+icu_multilib.search_by_collator_version = false
+icu_multilib.include_builtin = false
+^D
+$ pg_ctl -D data -l logfile start
+$ psql template1
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  database "template1" has a collation version mismatch
+DETAIL:  The database was created using collation version 153.112, but the operating system provides version 153.97.
+HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE template1 REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
+=# ALTER DATABASE template1 REFRESH COLLATION VERSION;
+NOTICE:  changing version from 153.112 to 153.97
+ALTER DATABASE
+=# SELECT pg_import_system_collations('pg_catalog');
+ pg_import_system_collations
+-----------------------------
+                         792
+(1 row)
+
+=# \q
+$ psql postgres
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  database "postgres" has a collation version mismatch
+DETAIL:  The database was created using collation version 153.112, but the operating system provides version 153.97.
+HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE postgres REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
+=# ALTER DATABASE postgres REFRESH COLLATION VERSION;
+NOTICE:  changing version from 153.112 to 153.97
+ALTER DATABASE
+=# SELECT pg_import_system_collations('pg_catalog');
+ pg_import_system_collations
+-----------------------------
+                         792
+(1 row)
+
+=# \q
+$
+</programlisting>
+
+  <para>
+   The result will be a system based on ICU version <literal>65.1</literal>, and
+   <productname>PostgreSQL</productname> will use that version of ICU for all
+   collations that use the ICU provider.
+  </para>
+ </sect2>
+
+ <sect2 id="icu-multi-version">
+  <title>Multiple ICU Version Support</title>
+  <para>
+   Support for multiple ICU library versions in the same database allows the
+   system to adapt to changes in natural language over time. As these changes
+   are introduced in new versions of ICU, those new versions can be brought
+   into <productname>PostgreSQL</productname> incrementally without
+   interfering with existing data (or structures like indexes).
+  </para>
+  <sect3>
+   <title>Search by Collator Version</title>
+
+   <para>
+    Collators provided by ICU also have a version which is distinct from the
+    version of ICU. For instance, in ICU version <literal>70.1</literal>, the
+    collator for the <literal>en_US</literal> locale has version
+    <literal>153.112</literal>. A collator for a given locale may or may not
+    be assigned a new version when the ICU library version is updated.
+   </para>
+
+   <para>
+    When a collation is created, <productname>PostgreSQL</productname> obtains
+    the version of the collator from the ICU library and records it in the
+    catalog. This recorded version can be updated using <literal>ALTER
+    COLLATION ... REFRESH VERSION</literal>; or <literal>ALTER DATABASE
+    ... REFRESH COLLATION VERSION</literal> if it's the database's default
+    collation. These <literal>ALTER</literal> commands obtain the new collator
+    version from the ICU library identified by
+    <varname>icu_multilib.default_icu_version</varname> if set; otherwise they
+    obtain the new collator version from the built-in ICU library.
+   </para>
+
+   <para>
+    Set <varname>icu_multilib.search_by_collator_version</varname> to
+    <literal>true</literal> to enable <filename>icu_multilib</filename> to
+    search (in descending order of ICU major version) among the
+    available libraries to find the first one that provides a collator with a
+    version that matches the one recorded in the catalog.
+   </para>
+
+   <para>
+    For example, ICU versions <literal>62.2</literal> and
+    <literal>63.2</literal> both provide collator version
+    <literal>153.88</literal> for the locale <literal>en_US</literal>; while
+    ICU version <literal>64.2</literal> provides collator version
+    <literal>153.97</literal> for the same locale. Searching for an ICU
+    library for a collation with a recorded version of
+    <literal>153.88</literal> will find the ICU library with version
+    <literal>63.2</literal>, because that's the ICU library with the highest
+    major version (<literal>63</literal>) that provides collator version
+    <literal>153.88</literal>. The function
+    <function>icu_multilib.collator_version_search</function> simulates this
+    search for the given locale and version:
+   </para>
+<programlisting>
+SELECT * FROM icu_multilib.collator_version_search('en_US', '153.88');
+ icu_version | uca_version | collator_version
+-------------+-------------+------------------
+ 63.2        | 11.0        | 153.88
+(1 row)
+
+</programlisting>
+
+   <note>
+    <para>
+     Searching by collator version relies on the assumption that collators
+     with the same version behave identically regardless of which ICU library
+     provides it. The administrator must weigh the risks of this assumption
+     against the convenience when managing multiple major versions of ICU.
+    </para>
+   </note>
+  </sect3>
+ </sect2>
+
+ <sect2 id="icu-multilib-configuration-parameters">
+  <title>Configuration Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.library_path</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.library_path</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      The filesystem path in which to search for ICU libraries. Must be set
+      for <filename>icu_multilib</filename> to be useful. Multiple major
+      versions of ICU may be present; but for each major version, only a
+      single minor version can be present.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.version_mismatch_log_level</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.version_mismatch_log_level</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      A version mismatch happens when the version of a collator provided by
+      the ICU library is different than then collator version recorded when
+      the collation was created (or last refreshed).
+     </para>
+     <para>
+      When a version mismatch is detected, a log message is emitted at this
+      level. <literal>WARNING</literal> is the default. If the level is set to
+      <literal>ERROR</literal>, it may interfere with normal operation.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.library_search_log_level</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.library_search_log_level</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      <filename>icu_multilib</filename> emits messages at this level while
+      determining which ICU library to choose. The default is
+      <literal>DEBUG1</literal>. This setting is useful to diagnose problems
+      when a suitable ICU library is not found.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.default_icu_version</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.default_icu_version</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Identifies the version of the ICU library to choose; or, if <varname>icu_multilib.search_by_collator_version</varname> is <literal>true</literal>, the ICU library to choose if no identical match to the collator version is found. The default is the empty string, meaning to use the built-in ICU library.
+     </para>
+     <para>
+      Valid values are either of the form <literal>major.minor</literal> such
+      as <literal>70.1</literal>; or a major version number alone, such as
+      <literal>70</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.include_builtin</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.include_builtin</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Determines whether the built-in version of ICU is included when
+      searching for a suitable ICU library. The default is
+      <literal>true</literal>.
+     </para>
+     <para>
+      Even if set to <literal>false</literal>, the built-in ICU library may be
+      used if no other suitable ICU library is identified and
+      <varname>icu_multilib.default_icu_version</varname> is set to the empty
+      string.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.search_by_collator_version</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.search_by_collator_version</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Causes <filename>icu_multilib</filename> to identify the ICU library
+      with the highest major version that offers the required collator with an
+      exactly-matching collator version. The default is <literal>false</literal>.
+     </para>
+     <para>
+      When set to <literal>true</literal>, <filename>icu_multilib</filename>
+      relies on equal collator versions to produce identical collation order.
+     </para>
+     <para>
+      If no ICU library is found with an exactly-matching collator version,
+      <filename>icu_multilib</filename> will fall back to the
+      <varname>icu_multilib.default_icu_version</varname>; or if not set, to
+      the built-in ICU library.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Functions</title>
+
+  <table id="icu-multilib-functions">
+   <title><filename>icu_multilib</filename> Functions</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       Function
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.library_versions</function> ()
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>unicode_version</parameter> <type>text</type>,
+       <parameter>cldr_version</parameter> <type>text</type>,
+       <parameter>libicui18n_name</parameter> <type>text</type>,
+       <parameter>libicuuc_name</parameter> <type>text</type>)
+      </para>
+      <para>
+       Returns details for each available ICU library found in
+       <varname>icu_multilib.library_path</varname>. Also includes the
+       built-in ICU library iv <varname>icu_multilib.include_builtin</varname>
+       is <literal>true</literal>.
+      </para></entry>
+     </row>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.library_collators</function> ( <parameter>major_version</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>null</literal> )
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>locale</parameter> <type>text</type>,
+       <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>uca_version</parameter> <type>text</type>,
+       <parameter>collator_version</parameter> <type>text</type> )
+      </para>
+      <para>
+       Returns details for all available collators provided by the ICU library
+       with the given <parameter>major_version</parameter>; or by the built-in
+       ICU library if <parameter>major_version</parameter> is
+       <literal>null</literal>.
+      </para></entry>
+     </row>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.collator_version_search</function> ( <parameter>locale</parameter> <type>text</type>, <parameter>requested_version</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>null</literal>, <parameter>log_ok</parameter> <type>boolean</type> <literal>DEFAULT</literal> <literal>false</literal> )
+       <returnvalue>record</returnvalue>
+       ( <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>uca_version</parameter> <type>text</type>,
+       <parameter>collator_version</parameter> <type>text</type> )
+      </para>
+      <para>
+       Performs a search for the appropriate ICU library given the locale name
+       and requested collator version, and returns details about the ICU
+       library.
+      </para>
+      <para>
+       If <parameter>requested_version</parameter> is <literal>null</literal>,
+       it will return the ICU library identified by
+       <varname>icu_multilib.default_icu_version</varname> if set to
+       <literal>true</literal>; otherwise the built-in ICU library.
+      </para>
+      <para>
+       If <parameter>log_ok</parameter> is <literal>true</literal>, the search
+       may emit log messages at the level
+       <varname>icu_multilib.library_search_log_level</varname>, which may be
+       useful for diagnosing misconfiguration.
+      </para></entry>
+     </row>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.collator_versions</function> ( <parameter>locale</parameter> <type>text</type> )
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>uca_version</parameter> <type>text</type>,
+       <parameter>collator_version</parameter> <type>text</type> )
+      </para>
+      <para>
+       Returns details about available collators for the given
+       <parameter>locale</parameter> from all available ICU libraries.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+ </sect2>
+
+ <sect2 id="icu-multilib-examples">
+  <title>Examples</title>
+
+<screen>
+SELECT icu_version, unicode_version, cldr_version FROM icu_multilib.library_versions() LIMIT 3;
+ icu_version | unicode_version | cldr_version
+-------------+-----------------+--------------
+ 70.1        | 14.0            | 40.0
+ 69.1        | 13.0            | 39.0
+ 68.2        | 13.0            | 38.1
+(3 rows)
+</screen>
+
+<screen>
+SELECT * FROM icu_multilib.collator_versions('en_US') WHERE icu_version BETWEEN '60.0' and '65.0';
+ icu_version | uca_version | collator_version
+-------------+-------------+------------------
+ 64.2        | 12.1        | 153.97
+ 63.2        | 11.0        | 153.88
+ 62.2        | 11.0        | 153.88
+ 61.2        | 10.0        | 153.80
+ 60.3        | 10.0        | 153.80
+(5 rows)
+</screen>
+
+<screen>
+SELECT * FROM icu_multilib.collator_version_search('en_US', '153.97');
+ icu_version | uca_version | collator_version
+-------------+-------------+------------------
+ 65.1        | 12.1        | 153.97
+(1 row)
+</screen>
+
+ </sect2>
+
+ <sect2 id="icu-multilib-author">
+  <title>Authors</title>
+
+  <para>
+   Jeff Davis <email>jdavis@postgresql.org</email>, Thomas Munro <email>tmunro@postgresql.org</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
2.34.1

#139

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#138)

5 attachment(s)

Re: Collation version tracking for macOS

On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote:

Attached are a new set of patches, including a major enhancement: the
icu_multilib contrib module.

Attached rebased v8.

[ It looks like my email client truncated the last email somehow, in
case someone was wondering why it just stopped. ]

The big change is the introduction of the icu_multilib contrib module
which provides a lot of the functionality requested in this thread:

* icu version stability, which allows you to "lock down" ICU to a
specific major and minor version (or major version only)
* multi-lib ICU, which (if a GUC is set) will enable the "search by
collversion" behavior. Some doubts were raised about the wisdom of this
approach, but it's the only multi-lib solution we have without doing
some significant catalog work.

I rendered the HTML docs for icu_multilib and attached to this email to
make it easier to view.

icu_multilib assumes that the various ICU library versions are already
available in a single location, most likely installed with a package
manager. That location can be the same as the built-in ICU, or a
different location. Ideally, packagers would start to offer a few
"stable" versions of ICU that would be available for a long time, but
it will take a while for that to happen. So for now, it's up to the
user to figure out how to get the right versions of ICU on their system
and keep them there.

Automated tests of icu_multilib are a problem unless the one running
the tests is willing to compile the right versions of ICU (like I did).
But I at least have automated tests for the hooks by using the test
module test_collator_lib_hooks.

The v7 patches in this thread are dependent on the pure refactoring
patches in this CF entry:

https://commitfest.postgresql.org/41/3935/

/messages/by-id/052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com

The requested functionality _not_ offered by icu_multilib is tying a
specific collation to a specific ICU version. A few variants were
proposed, the latest is to tie a collation to the library file itself
through the provider. That needs to be done with proper catalog support
in core. But I believe the work I've done here has made a lot of
progress in that direction, and also shows the versatility of the new
hook to solve at least some problems.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

icu-multilib.htmlapplication/xhtml+xml; name=icu-multilib.htmlDownload

v8-0004-icu_multilib-contrib-extension-for-managing-ICU-l.patchtext/x-patch; charset=UTF-8; name=v8-0004-icu_multilib-contrib-extension-for-managing-ICU-l.patchDownload

From 1f305419c350a6f0db8245cdc4630c4942a2e1a9 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Sat, 14 Jan 2023 12:17:05 -0800
Subject: [PATCH v8 4/4] icu_multilib: contrib extension for managing ICU
 libraries.

The icu_multilib module provides control over the version (or
versions) of the ICU provider library used by PostgreSQL, which can be
different from the version of ICU with which it was built.

Control over the specific ICU library is important, because any
differences in collation may result in corrupt indexes, and possibly
other problems.

The primary use case is the ability to stabilize the ICU version
("lock" it to a specific version), even if the built-in ICU library
changes. It can be "locked" to a specific major and minor version (the
safest option); or to a major version only, enabling non-disruptive
minor version updates of ICU (albeit with some risk of subtle
differences in collation behavior).

If enabled, icu_multilib also provides support for multiple ICU
libraries by using "search by collator version" logic. This logic
assumes that identical collator versions (as provided by ICU) mean
identical behavior, which also carries some risk.

Author: Jeff Davis, Thomas Munro
---
 contrib/Makefile                           |   6 +
 contrib/icu_multilib/Makefile              |  21 +
 contrib/icu_multilib/icu_multilib--1.0.sql |  43 +
 contrib/icu_multilib/icu_multilib.c        | 880 +++++++++++++++++++++
 contrib/icu_multilib/icu_multilib.control  |   7 +
 contrib/icu_multilib/meson.build           |  24 +
 contrib/meson.build                        |   3 +
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/icu-multilib.sgml             | 496 ++++++++++++
 10 files changed, 1482 insertions(+)
 create mode 100644 contrib/icu_multilib/Makefile
 create mode 100644 contrib/icu_multilib/icu_multilib--1.0.sql
 create mode 100644 contrib/icu_multilib/icu_multilib.c
 create mode 100644 contrib/icu_multilib/icu_multilib.control
 create mode 100644 contrib/icu_multilib/meson.build
 create mode 100644 doc/src/sgml/icu-multilib.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index bbf220407b..06447d9346 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -65,6 +65,12 @@ else
 ALWAYS_SUBDIRS += uuid-ossp
 endif
 
+ifeq ($(with_icu),yes)
+SUBDIRS += icu_multilib
+else
+ALWAYS_SUBDIRS += icu_multilib
+endif
+
 ifeq ($(with_libxml),yes)
 SUBDIRS += xml2
 else
diff --git a/contrib/icu_multilib/Makefile b/contrib/icu_multilib/Makefile
new file mode 100644
index 0000000000..4823007abe
--- /dev/null
+++ b/contrib/icu_multilib/Makefile
@@ -0,0 +1,21 @@
+# contrib/icu_multilib/Makefile
+
+MODULE_big = icu_multilib
+OBJS = \
+	$(WIN32RES) \
+	icu_multilib.o
+
+EXTENSION = icu_multilib
+DATA = icu_multilib--1.0.sql
+PGFILEDESC = "icu_multilib -- support multiple ICU library versions"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/icu_multilib
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/icu_multilib/icu_multilib--1.0.sql b/contrib/icu_multilib/icu_multilib--1.0.sql
new file mode 100644
index 0000000000..6751242bb4
--- /dev/null
+++ b/contrib/icu_multilib/icu_multilib--1.0.sql
@@ -0,0 +1,43 @@
+/* contrib/icu_multilib/icu_multilib--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION icu_multilib" to load this file. \quit
+
+CREATE FUNCTION library_versions(
+    icu_version OUT TEXT,
+    unicode_version OUT TEXT,
+    cldr_version OUT TEXT,
+    libicui18n_name OUT TEXT,
+    libicuuc_name OUT TEXT
+  )
+  RETURNS SETOF RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'library_versions';
+
+CREATE FUNCTION library_collators(
+    major_version IN TEXT DEFAULT NULL,
+    locale OUT TEXT,
+    icu_version OUT TEXT,
+    uca_version OUT TEXT,
+    collator_version OUT TEXT
+  )
+  RETURNS SETOF RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'library_collators';
+
+CREATE FUNCTION collator_version_search (
+    locale IN TEXT,
+    requested_version IN TEXT DEFAULT NULL,
+    log_ok IN BOOLEAN DEFAULT FALSE,
+    icu_version OUT TEXT,
+    uca_version OUT TEXT,
+    collator_version OUT TEXT)
+  RETURNS RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'collator_version_search';
+
+CREATE FUNCTION collator_versions (
+    locale IN TEXT,
+    icu_version OUT TEXT,
+    uca_version OUT TEXT,
+    collator_version OUT TEXT
+  )
+  RETURNS SETOF RECORD LANGUAGE C
+  AS 'MODULE_PATHNAME', 'collator_versions';
diff --git a/contrib/icu_multilib/icu_multilib.c b/contrib/icu_multilib/icu_multilib.c
new file mode 100644
index 0000000000..2fde2a85d5
--- /dev/null
+++ b/contrib/icu_multilib/icu_multilib.c
@@ -0,0 +1,880 @@
+
+#include "postgres.h"
+
+#include <dlfcn.h>
+#include <limits.h>
+
+#include "common/string.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_locale_internal.h"
+
+#ifndef USE_ICU
+#error "ICU support is required to build icu_multilib"
+#endif
+
+/*
+ * We don't want to call into dlopen'd ICU libraries that are newer than the
+ * one we were compiled and linked against, just in case there is an
+ * incompatible API change.
+ */
+#define PG_MAX_ICU_MAJOR U_ICU_VERSION_MAJOR_NUM
+
+/*
+ * The oldest ICU release we're likely to encounter, and that has all the
+ * funcitons required.
+ */
+#define PG_MIN_ICU_MAJOR 50
+
+/*
+ * Enough to hold entries for minimum to maximum supported ICU versions, and
+ * also the builtin ICU (if icu_multilib.include_builtin_icu is set).
+ */
+#define ICU_LIB_TABLESIZE (PG_MAX_ICU_MAJOR - PG_MIN_ICU_MAJOR + 2)
+
+typedef struct icu_version
+{
+	int major;
+	int minor;
+} icu_version;
+
+static const struct config_enum_entry log_level_options[] = {
+	{"debug5", DEBUG5, false},
+	{"debug4", DEBUG4, false},
+	{"debug3", DEBUG3, false},
+	{"debug2", DEBUG2, false},
+	{"debug1", DEBUG1, false},
+	{"debug", DEBUG2, true},
+	{"log", LOG, false},
+	{"info", INFO, true},
+	{"notice", NOTICE, false},
+	{"warning", WARNING, false},
+	{"error", ERROR, false},
+	{NULL, 0, false}
+};
+
+PG_MODULE_MAGIC;
+
+static get_icu_library_hook_type prev_icu_library_hook = NULL;
+
+static pg_icu_library *icu_library_table[ICU_LIB_TABLESIZE] = {};
+
+static char *icu_library_path = "";
+static char *default_icu_version = "";
+static icu_version default_version = { .major = -1, .minor = -1 };
+static bool search_by_collator_version = false;
+static bool include_builtin = true;
+static int version_mismatch_log_level = WARNING;
+static int library_search_log_level = DEBUG1;
+
+static void initialize_gucs_1(void);
+static void initialize_gucs_2(void);
+static pg_icu_library *icu_multilib_hook(Oid collid, const char *locale,
+										 const char *version);
+static pg_icu_library *icu_multilib_search(Oid collid, const char *locale,
+										   const char *version,
+										   bool logOk);
+static void load_all_libraries(void);
+static pg_icu_library *load_icu_library(int major);
+static pg_icu_library *get_icu_by_major_version(int major);
+static void get_library_version(const pg_icu_library *lib, int *major,
+								int *minor);
+
+static bool check_string_version(char **newval, void **extra, GucSource source);
+static void assign_default_version(const char *newval, void *extra);
+
+void
+_PG_init()
+{
+	/*TODO: error messages */
+	if (!process_shared_preload_libraries_in_progress)
+		elog(ERROR, "icu_multilib must be loaded by shared_preload_libraries");
+
+	prev_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = icu_multilib_hook;
+
+	initialize_gucs_1();
+	load_all_libraries();
+	initialize_gucs_2();
+}
+
+/* initialize GUCs before loading ICU libraries */
+static void
+initialize_gucs_1()
+{
+	/*
+	 * The library search path is initialized once per server start and not
+	 * changable -- this keeps the memory usage in TopMemoryContext bounded.
+	 */
+	DefineCustomStringVariable("icu_multilib.library_path",
+							   "Filesystem path where ICU libraries are installed.",
+							   NULL,
+							   &icu_library_path,
+							   "",
+							   PGC_POSTMASTER,
+							   0, NULL, NULL, NULL);
+	DefineCustomEnumVariable("icu_multilib.version_mismatch_log_level",
+							 "Level of log message when a collator version mismatch is detected.",
+							 NULL,
+							 &version_mismatch_log_level,
+							 WARNING,
+							 log_level_options,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+	DefineCustomEnumVariable("icu_multilib.library_search_log_level",
+							 "Level of log messages related to searching for an ICU library.",
+							 NULL,
+							 &library_search_log_level,
+							 DEBUG1,
+							 log_level_options,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+}
+
+/* initialize GUCs after loading ICU libraries */
+static void
+initialize_gucs_2()
+{
+	DefineCustomStringVariable("icu_multilib.default_icu_version",
+							   "The version of the default ICU library.",
+							   "Can be specified with major and minor versions, or major version only.",
+							   &default_icu_version,
+							   "",
+							   PGC_SUSET,
+							   0, check_string_version, assign_default_version, NULL);
+	DefineCustomBoolVariable("icu_multilib.include_builtin",
+							 "Include built-in ICU library when listing or searching libraries.",
+							 NULL,
+							 &include_builtin,
+							 true,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+	DefineCustomBoolVariable("icu_multilib.search_by_collator_version",
+							 "Enable searching for the ICU library based on the collator version.",
+							 NULL,
+							 &search_by_collator_version,
+							 false,
+							 PGC_SUSET,
+							 0, NULL, NULL, NULL);
+}
+
+static bool
+parse_icu_version(const char *version, icu_version *out_version)
+{
+	size_t	input_len = strlen(version);
+	int		parse_len;
+	int		parse_major;
+	int		parse_minor;
+
+	if (sscanf(version, "%d.%d%n",
+			   &parse_major, &parse_minor, &parse_len) == 2 &&
+		input_len == parse_len)
+	{
+		out_version->major = parse_major;
+		out_version->minor = parse_minor;
+		return true;
+	}
+	else if (sscanf(version, "%d%n", &parse_major, &parse_len) == 1 &&
+			 input_len == parse_len)
+	{
+		out_version->major = parse_major;
+		out_version->minor = -1;
+		return true;
+	}
+
+	return false;
+}
+
+static bool
+check_string_version(char **newval, void **extra, GucSource source)
+{
+	pg_icu_library	*lib;
+	icu_version		*myextra;
+	icu_version		 parsed_version;
+
+	if (*newval == NULL || *newval[0] == '\0')
+		return true;
+
+	if (!parse_icu_version(*newval, &parsed_version))
+	{
+		GUC_check_errmsg("error parsing ICU version \"%s\"", *newval);
+		GUC_check_errhint("The ICU version should be either in the form \"major.minor\", "
+						  "or just be a plain major version number.");
+		return false;
+	}
+
+	if (parsed_version.major < PG_MIN_ICU_MAJOR ||
+		parsed_version.major > PG_MAX_ICU_MAJOR)
+	{
+		GUC_check_errmsg("major version %d is out of range",
+						 parsed_version.major);
+		GUC_check_errhint("ICU major version must be between %d and %d.",
+						  PG_MIN_ICU_MAJOR, PG_MAX_ICU_MAJOR);
+		return false;
+	}
+
+	lib = get_icu_by_major_version(parsed_version.major);
+	if (!lib)
+	{
+		GUC_check_errmsg("ICU library with major version %d not found",
+						 parsed_version.major);
+		return false;
+	}
+
+	if (parsed_version.minor != -1 && lib->minor_version != parsed_version.minor)
+	{
+		GUC_check_errmsg("ICU library with major version %d has minor version %d; "
+						 "expected minor version %d",
+						 lib->major_version, lib->minor_version,
+						 parsed_version.minor);
+		return false;
+	}
+
+	myextra = guc_malloc(ERROR, sizeof(icu_version));
+	*myextra = parsed_version;
+	*extra = myextra;
+
+	return true;
+}
+
+static void
+assign_default_version(const char *newval, void *extra)
+{
+	icu_version *myextra = (icu_version *) extra;
+
+	if (myextra == NULL)
+		return;
+
+	default_version = *myextra;
+}
+
+static void
+load_all_libraries()
+{
+	pg_icu_library *builtin = get_builtin_icu_library();
+
+	icu_library_table[ICU_LIB_TABLESIZE - 1] = builtin;
+
+	ereport(LOG, (errmsg("icu_multilib: retrieved built-in ICU version %d.%d",
+						 builtin->major_version, builtin->minor_version)));
+
+	for (int major = PG_MAX_ICU_MAJOR; major >= PG_MIN_ICU_MAJOR; major--)
+		icu_library_table[major - PG_MIN_ICU_MAJOR] = load_icu_library(major);
+}
+
+static pg_icu_library *
+get_icu_by_major_version(int major)
+{
+	pg_icu_library *lib;
+	if (major < PG_MIN_ICU_MAJOR || major > PG_MAX_ICU_MAJOR)
+		ereport(ERROR, (errmsg("major version %d is out of range", major)));
+
+	lib = icu_library_table[major - PG_MIN_ICU_MAJOR];
+	return lib;
+}
+
+static void
+get_library_version(const pg_icu_library *lib, int *major, int *minor)
+{
+	UVersionInfo version_info;
+	lib->getICUVersion(version_info);
+	*major = version_info[0];
+	*minor = version_info[1];
+	return;
+}
+
+/*
+ * Fill in out_version (which must have U_MAX_VERSION_STRING_LENGTH bytes
+ * available) with the collator version of the given locale in the given
+ * library. Return false if the collator is not found.
+ */
+static bool
+lib_collator_version(pg_icu_library *lib, const char *locale,
+					 char *out_version)
+{
+	UCollator		*collator;
+	UVersionInfo	 version_info;
+	UErrorCode		 status;
+
+	status = U_ZERO_ERROR;
+	collator = lib->openCollator(locale, &status);
+
+	if (!collator)
+		return false;
+
+	lib->getCollatorVersion(collator, version_info);
+	lib->versionToString(version_info, out_version);
+	lib->closeCollator(collator);
+
+	return true;
+}
+
+/*
+ * Find the right ICU library for the given locale and version. The resulting
+ * library may or may not provide a collator with an exactly-matching
+ * version.
+ *
+ * If search_by_collator_version is set, scan the table (first the built-in
+ * ICU library, then descending order of major versions) to find the first
+ * library that provides a collator of the given locale with a matching
+ * version.
+ *
+ * If no exactly matching version is found, and default_major_version is set,
+ * return the default library.
+ *
+ * Otherwise fall back to the built-in library.
+ */
+static pg_icu_library *
+icu_multilib_hook(Oid collid, const char *locale,
+				  const char *requested_version)
+{
+	return icu_multilib_search(collid, locale, requested_version, true);
+}
+
+/*
+ *
+ */
+static pg_icu_library *
+icu_multilib_search(Oid collid, const char *locale,
+					const char *requested_version, bool logOk)
+{
+	char			 actual_version[U_MAX_VERSION_STRING_LENGTH];
+	pg_icu_library	*found_lib = NULL;
+
+	/*
+	 * If another hook was set first, defer to that unless it returns NULL or
+	 * a library that doesn't contain the given collator at all. This may
+	 * result in a mismatching collator version, but we don't want to
+	 * speculate about what's better or worse in the presence of other hooks.
+	 */
+	if (prev_icu_library_hook)
+	{
+		pg_icu_library *tmp_lib;
+		tmp_lib = prev_icu_library_hook(collid, locale, requested_version);
+		if (tmp_lib && lib_collator_version(tmp_lib, locale, actual_version))
+			found_lib = tmp_lib;
+	}
+
+	if (!found_lib && search_by_collator_version && requested_version != NULL)
+	{
+		/*
+		 * Search from newest library to oldest for a matching version of the
+		 * collator with the given name.
+		 */
+		for (int i = ICU_LIB_TABLESIZE - 1; i >= 0; i--)
+		{
+			char			 tmp_version[U_MAX_VERSION_STRING_LENGTH];
+			pg_icu_library	*tmp_lib = icu_library_table[i];
+
+			if (tmp_lib == NULL)
+				continue;
+
+			if (!include_builtin && i == ICU_LIB_TABLESIZE - 1)
+				continue;
+
+			if (lib_collator_version(tmp_lib, locale, tmp_version) &&
+				strcmp(requested_version, tmp_version) == 0)
+			{
+				strcpy(actual_version, tmp_version);
+				found_lib = tmp_lib;
+				break;
+			}
+		}
+	}
+
+	if (!found_lib && default_version.major != -1)
+	{
+		pg_icu_library *tmp_lib;
+		tmp_lib = get_icu_by_major_version(default_version.major);
+		if (!tmp_lib)
+			ereport(ERROR,
+					(errmsg("icu_multilib: default major version %d not found",
+							default_version.major)));
+		if (lib_collator_version(tmp_lib, locale, actual_version))
+			found_lib = tmp_lib;
+		else if (logOk)
+			ereport(library_search_log_level,
+					(errmsg("icu_multilib: found default ICU %d.%d, but collator \"%s\" not found",
+							tmp_lib->major_version, tmp_lib->minor_version, locale)));
+	}
+
+	if (!found_lib && include_builtin)
+	{
+		pg_icu_library *tmp_lib;
+		tmp_lib = icu_library_table[ICU_LIB_TABLESIZE - 1];
+		if (lib_collator_version(tmp_lib, locale, actual_version))
+			found_lib = tmp_lib;
+		else if (logOk)
+			ereport(library_search_log_level,
+					(errmsg("icu_multilib: found built-in ICU %d.%d, but collator \"%s\" not found",
+							tmp_lib->major_version, tmp_lib->minor_version, locale)));
+	}
+
+	/* if not found, fall back to built-in or other hook */
+	if (!found_lib)
+		return NULL;
+
+	if (logOk)
+		ereport(library_search_log_level,
+				(errmsg("icu_multilib: found ICU version %d.%d providing collator version \"%s\" for locale \"%s\"",
+						found_lib->major_version, found_lib->minor_version,
+						actual_version, locale)));
+
+	/*
+	 * This is somewhat redundant with a similar warning in pg_locale.c, but
+	 * it provides details about the locale name and ICU version, which is
+	 * helpful when multiple ICU libraries are in use.
+	 */
+	if (requested_version && logOk &&
+		strcmp(requested_version, actual_version) != 0)
+	{
+		ereport(version_mismatch_log_level,
+				(errmsg("icu_multilib: collator version mismatch detected for locale \"%s\"",
+						locale),
+				 errdetail("ICU %d.%d provides collator version \"%s\" for locale \"%s\"; expected version \"%s\".",
+						   found_lib->major_version, found_lib->minor_version,
+						   actual_version, locale, requested_version)));
+	}
+
+	return found_lib;
+}
+
+PG_FUNCTION_INFO_V1(library_versions);
+Datum
+library_versions(PG_FUNCTION_ARGS)
+{
+#define LIBRARY_VERSIONS_COLS 5
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum           values[LIBRARY_VERSIONS_COLS];
+	bool            nulls[LIBRARY_VERSIONS_COLS];
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (int i = ICU_LIB_TABLESIZE - 1; i >= 0; i--)
+	{
+		UErrorCode      status;
+		UVersionInfo version_info;
+		char            version_string[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*lib = icu_library_table[i];
+
+		if (lib == NULL)
+			continue;
+
+		if (!include_builtin && i == ICU_LIB_TABLESIZE - 1)
+			continue;
+
+		lib->getICUVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[0] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[0] = false;
+
+		lib->getUnicodeVersion(version_info);
+		lib->versionToString(version_info, version_string);
+		values[1] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[1] = false;
+		status = U_ZERO_ERROR;
+		lib->getCLDRVersion(version_info, &status);
+		if (U_SUCCESS(status))
+		{
+			lib->versionToString(version_info, version_string);
+			values[2] = PointerGetDatum(cstring_to_text(version_string));
+			nulls[2] = false;
+		}
+		else
+		{
+			nulls[2] = true;
+		}
+
+		values[3] = PointerGetDatum(cstring_to_text(lib->libicui18n_name));
+		nulls[3] = false;
+
+		values[4] = PointerGetDatum(cstring_to_text(lib->libicuuc_name));
+		nulls[4] = false;
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+#define COLLATOR_DETAIL_COLS 3
+static void
+make_collator_detail_record(pg_icu_library *lib, const char *locale,
+							Datum *values, bool *nulls)
+{
+	UErrorCode		 status;
+	UCollator		*collator;
+	UVersionInfo	 version_info;
+	char			 version_string[U_MAX_VERSION_STRING_LENGTH];
+
+	status = U_ZERO_ERROR;
+	collator = lib->openCollator(locale, &status);
+
+	lib->getICUVersion(version_info);
+	lib->versionToString(version_info, version_string);
+	values[0] = PointerGetDatum(cstring_to_text(version_string));
+	nulls[0] = false;
+
+	if (collator)
+	{
+		lib->getUCAVersion(collator, version_info);
+		lib->versionToString(version_info, version_string);
+		values[1] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[1] = false;
+	}
+	else
+		nulls[1] = true;
+
+	if (collator)
+	{
+		lib->getCollatorVersion(collator, version_info);
+		lib->versionToString(version_info, version_string);
+		values[2] = PointerGetDatum(cstring_to_text(version_string));
+		nulls[2] = false;
+	}
+	else
+		nulls[2] = true;
+
+	if (collator)
+		lib->closeCollator(collator);
+
+	return;
+}
+
+PG_FUNCTION_INFO_V1(library_collators);
+Datum
+library_collators(PG_FUNCTION_ARGS)
+{
+#define LIBRARY_COLLATORS_COLS (COLLATOR_DETAIL_COLS + 1)
+	pg_icu_library	*lib  = NULL;
+	ReturnSetInfo	*rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum			 values[LIBRARY_COLLATORS_COLS];
+	bool			 nulls[LIBRARY_COLLATORS_COLS];
+
+	if (PG_ARGISNULL(0))
+	{
+		lib = icu_library_table[ICU_LIB_TABLESIZE - 1];
+	}
+	else
+	{
+		char *major_str = text_to_cstring(PG_GETARG_TEXT_PP(0));
+		int major = atoi(major_str); /* TODO? */
+
+		if (major < PG_MIN_ICU_MAJOR || major > PG_MAX_ICU_MAJOR)
+			ereport(ERROR, (errmsg("icu_multilib: major version %d is out of range", major)));
+		lib = get_icu_by_major_version(major);
+		if (!lib)
+			ereport(ERROR, (errmsg("icu_multilib: major version %d not found", major)));
+	}
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (int i = -1; i < lib->countAvailable(); i++)
+	{
+		const char	*name;
+		const char	*locale;
+		UErrorCode	 status;
+		char		 langtag[ULOC_FULLNAME_CAPACITY];
+
+		if (i == -1)
+			name = "";		/* ICU root locale */
+		else
+			name = lib->getAvailable(i);
+
+		status = U_ZERO_ERROR;
+		lib->toLanguageTag(name, langtag, sizeof(langtag), true, &status);
+
+		if (U_FAILURE(status))
+			continue;
+
+		locale = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
+
+		if (!pg_is_ascii(langtag) || !pg_is_ascii(locale))
+			continue;
+
+		values[0] = PointerGetDatum(cstring_to_text(locale));
+		nulls[0] = false;
+
+		make_collator_detail_record(lib, locale, values + 1, nulls + 1);
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+PG_FUNCTION_INFO_V1(collator_version_search);
+Datum
+collator_version_search(PG_FUNCTION_ARGS)
+{
+#define COLLATOR_VERSION_SEARCH_COLS COLLATOR_DETAIL_COLS
+	const char		*locale;
+	const char		*requested_version = NULL;
+	bool			 logOk;
+	pg_icu_library	*lib;
+	int				 major, minor;
+	TupleDesc		 tupdesc;
+	HeapTuple		 tuple;
+	Datum			 values[COLLATOR_VERSION_SEARCH_COLS];
+	bool			 nulls[COLLATOR_VERSION_SEARCH_COLS];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR, (errmsg("locale argument must be non-NULL")));
+	if (PG_ARGISNULL(2))
+		ereport(ERROR, (errmsg("log_ok argument must be non-NULL")));
+
+	locale = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	if (!PG_ARGISNULL(1))
+		requested_version = text_to_cstring(PG_GETARG_TEXT_PP(1));
+	logOk = PG_GETARG_BOOL(2);
+
+	lib = icu_multilib_search(InvalidOid, locale, requested_version, logOk);
+	get_library_version(lib, &major, &minor);
+
+	make_collator_detail_record(lib, locale, values, nulls);
+
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+PG_FUNCTION_INFO_V1(collator_versions);
+Datum
+collator_versions(PG_FUNCTION_ARGS)
+{
+#define COLLATOR_VERSIONS_COLS COLLATOR_DETAIL_COLS
+	const char *locale;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Datum           values[COLLATOR_VERSIONS_COLS];
+	bool            nulls[COLLATOR_VERSIONS_COLS];
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR, (errmsg("locale argument must be non-NULL")));
+	locale = text_to_cstring(PG_GETARG_TEXT_PP(0));
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	for (int i = ICU_LIB_TABLESIZE - 1; i >= 0; i--)
+	{
+		pg_icu_library	*lib = icu_library_table[i];
+
+		if (lib == NULL)
+			continue;
+
+		if (!include_builtin && i == ICU_LIB_TABLESIZE - 1)
+			continue;
+
+		make_collator_detail_record(lib, locale, values, nulls);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+static void
+make_icu_library_names(int major, char **libicui18n, char **libicuuc)
+{
+	char libicui18n_tmp[MAXPGPATH];
+	char libicuuc_tmp[MAXPGPATH];
+
+	/*
+	 * See
+	 * https://unicode-org.github.io/icu/userguide/icu4c/packaging.html#icu-versions
+	 * for conventions on library naming on POSIX and Windows systems.  Apple
+	 * isn't mentioned but varies in the usual way.
+	 *
+	 * Format 1 is expected to be a major version-only symlink pointing to a
+	 * specific minor version (or on Windows it may be the actual library).
+	 * Format 2 is expected to be an actual library.
+	 */
+#ifdef WIN32
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%sicu%s%d" DLSUFFIX
+#elif defined(__darwin__)
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%slibicu%s.%d" DLSUFFIX
+#else
+#define ICU_LIBRARY_NAME_FORMAT1 "%s%slibicu%s" DLSUFFIX ".%d"
+#endif
+
+#ifdef WIN32
+#define PATH_SEPARATOR "\\"
+#define ICU_I18N "in"
+#define ICU_UC "uc"
+#else
+#define PATH_SEPARATOR "/"
+#define ICU_I18N "i18n"
+#define ICU_UC "uc"
+#endif
+
+	snprintf(libicui18n_tmp,
+			 MAXPGPATH,
+			 ICU_LIBRARY_NAME_FORMAT1,
+			 icu_library_path, icu_library_path[0] ? PATH_SEPARATOR : "",
+			 "i18n", major);
+	snprintf(libicuuc_tmp,
+			 MAXPGPATH,
+			 ICU_LIBRARY_NAME_FORMAT1,
+			 icu_library_path, icu_library_path[0] ? PATH_SEPARATOR : "",
+			 "uc", major);
+
+	*libicui18n = pstrdup(libicui18n_tmp);
+	*libicuuc = pstrdup(libicuuc_tmp);
+}
+
+#define MAX_SYMBOL_LEN 128
+static void *
+load_icu_function(void *handle, const char *function, int major)
+{
+	char     function_with_version[MAX_SYMBOL_LEN];
+	int		 ret;
+	void    *result;
+
+	/*
+	 * Try to look up the symbol with the library major version as a suffix.
+	 */
+	ret = snprintf(function_with_version, sizeof(function_with_version),
+				   "%s_%d", function, major);
+	if (ret >= sizeof(function_with_version))
+		elog(ERROR, "icu_multilib: cannot construct symbol name");
+	result = dlsym(handle, function_with_version);
+
+	/*
+	 * Library may have been configured with --disable-renaming, try without
+	 * major version suffix.
+	 */
+	if (result == NULL)
+		result = dlsym(handle, function);
+
+	return result;
+}
+
+#define LOAD_FUNC(DEST, LIBHANDLE, LIBNAME, FNAME)	\
+	do {										\
+		DEST = load_icu_function((LIBHANDLE), (FNAME), major);	\
+		if (DEST == NULL)						\
+		{										\
+			ereport(WARNING,					\
+					(errmsg("icu_multilib: could not find symbol \"%s\" in library \"%s\"",	\
+							(FNAME), (LIBNAME))));	\
+			goto err;							\
+		}										\
+	} while(0)
+
+#define LOAD_FUNC_I18N(DEST, FNAME) \
+	LOAD_FUNC(DEST, libicui18n_handle, libicui18n, FNAME)
+#define LOAD_FUNC_UC(DEST, FNAME) \
+	LOAD_FUNC(DEST, libicuuc_handle, libicuuc, FNAME)
+
+static pg_icu_library *
+load_icu_library(int major)
+{
+	UVersionInfo	 version_info;
+	pg_icu_library	*lib;
+	char			*libicui18n;
+	char			*libicuuc;
+	void			*libicui18n_handle = NULL;
+	void			*libicuuc_handle   = NULL;
+
+	make_icu_library_names(major, &libicui18n, &libicuuc);
+
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_icu_library));
+
+	libicui18n_handle = dlopen(libicui18n, RTLD_NOW | RTLD_LOCAL);
+	if (!libicui18n_handle)
+		return NULL;
+
+	/* Load the common library. */
+	libicuuc_handle = dlopen(libicuuc, RTLD_NOW | RTLD_LOCAL);
+	if (!libicui18n_handle)
+	{
+		elog(WARNING, "found ICU library \"%s\" but not companion library \"%s\"",
+			 libicui18n, libicuuc);
+		dlclose(libicui18n_handle);
+		return NULL;
+	}
+
+	/*
+	 * We only allocate the pg_icu_library object after successfully
+	 * opening the libraries to minimize the work done in the ENOENT case,
+	 * when probing a range of versions.  That means we might need to
+	 * clean up on allocation failure.
+	 */
+	lib = MemoryContextAllocExtended(TopMemoryContext, sizeof(*lib),
+									 MCXT_ALLOC_NO_OOM);
+	lib->libicui18n_name = MemoryContextAllocExtended(
+		TopMemoryContext, strlen(libicui18n) + 1, MCXT_ALLOC_NO_OOM);
+	lib->libicuuc_name = MemoryContextAllocExtended(
+		TopMemoryContext, strlen(libicuuc) + 1, MCXT_ALLOC_NO_OOM);
+
+	if (!lib || !lib->libicui18n_name || !lib->libicuuc_name)
+	{
+		dlclose(libicui18n_handle);
+		dlclose(libicuuc_handle);
+		elog(ERROR, "out of memory");
+	}
+
+	strcpy(lib->libicui18n_name, libicui18n);
+	strcpy(lib->libicuuc_name, libicuuc);
+
+	pfree(libicui18n);
+	pfree(libicuuc);
+
+	/* try to find all the symbols we need from the i18n library */
+	LOAD_FUNC_I18N(lib->getICUVersion, "u_getVersion");
+	LOAD_FUNC_I18N(lib->getUnicodeVersion, "u_getUnicodeVersion");
+	LOAD_FUNC_I18N(lib->getCLDRVersion, "ulocdata_getCLDRVersion");
+	LOAD_FUNC_I18N(lib->openCollator, "ucol_open");
+	LOAD_FUNC_I18N(lib->closeCollator, "ucol_close");
+	LOAD_FUNC_I18N(lib->getCollatorVersion, "ucol_getVersion");
+	LOAD_FUNC_I18N(lib->getUCAVersion, "ucol_getUCAVersion");
+	LOAD_FUNC_I18N(lib->versionToString, "u_versionToString");
+	LOAD_FUNC_I18N(lib->strcoll, "ucol_strcoll");
+	LOAD_FUNC_I18N(lib->strcollUTF8, "ucol_strcollUTF8");
+	LOAD_FUNC_I18N(lib->getSortKey, "ucol_getSortKey");
+	LOAD_FUNC_I18N(lib->nextSortKeyPart, "ucol_nextSortKeyPart");
+	LOAD_FUNC_I18N(lib->setUTF8, "uiter_setUTF8");
+	LOAD_FUNC_I18N(lib->errorName, "u_errorName");
+	LOAD_FUNC_I18N(lib->setAttribute, "ucol_setAttribute");
+
+	/* try to find all the symbols we need from the uc library */
+	LOAD_FUNC_UC(lib->strToUpper, "u_strToUpper");
+	LOAD_FUNC_UC(lib->strToLower, "u_strToLower");
+	LOAD_FUNC_UC(lib->strToTitle, "u_strToTitle");
+	LOAD_FUNC_UC(lib->openConverter, "ucnv_open");
+	LOAD_FUNC_UC(lib->closeConverter, "ucnv_close");
+	LOAD_FUNC_UC(lib->fromUChars, "ucnv_fromUChars");
+	LOAD_FUNC_UC(lib->toUChars, "ucnv_toUChars");
+	LOAD_FUNC_UC(lib->toLanguageTag, "uloc_toLanguageTag");
+	LOAD_FUNC_UC(lib->getDisplayName, "uloc_getDisplayName");
+	LOAD_FUNC_UC(lib->countAvailable, "uloc_countAvailable");
+	LOAD_FUNC_UC(lib->getAvailable, "uloc_getAvailable");
+
+	lib->getICUVersion(version_info);
+	lib->major_version = version_info[0];
+	lib->minor_version = version_info[1];
+
+	if (lib->major_version != major)
+		ereport(version_mismatch_log_level,
+				(errmsg("loaded library for major version %d, but library reports major version %d",
+						major, lib->major_version)));
+
+	ereport(LOG, (errmsg("icu_multilib: loaded ICU version %d.%d",
+						 lib->major_version, lib->minor_version)));
+
+	return lib;
+
+err:
+	dlclose(libicui18n_handle);
+	dlclose(libicuuc_handle);
+	pfree(lib->libicui18n_name);
+	pfree(lib->libicuuc_name);
+	pfree(lib);
+	return NULL;
+}
diff --git a/contrib/icu_multilib/icu_multilib.control b/contrib/icu_multilib/icu_multilib.control
new file mode 100644
index 0000000000..8f2c546781
--- /dev/null
+++ b/contrib/icu_multilib/icu_multilib.control
@@ -0,0 +1,7 @@
+# icu_multlib extension
+comment = 'support multiple ICU library versions'
+default_version = '1.0'
+module_pathname = '$libdir/icu_multilib'
+relocatable = false
+schema = icu_multilib
+trusted = true
diff --git a/contrib/icu_multilib/meson.build b/contrib/icu_multilib/meson.build
new file mode 100644
index 0000000000..f2ea4088e4
--- /dev/null
+++ b/contrib/icu_multilib/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+icu_multilib_sources = files(
+  'icu_multilib.c',
+)
+
+if host_system == 'windows'
+  icu_multilib_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'icu_multilib',
+    '--FILEDESC', 'icu_multilib -- support multiple ICU library versions',])
+endif
+
+icu_multilib = shared_module('icu_multilib',
+  icu_multilib_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += icu_multilib
+
+install_data(
+  'icu_multilib.control',
+  'icu_multilib--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..03f72147cf 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -28,6 +28,9 @@ subdir('fuzzystrmatch')
 subdir('hstore')
 subdir('hstore_plperl')
 subdir('hstore_plpython')
+if icu.found()
+   subdir('icu_multilib')
+endif
 subdir('intagg')
 subdir('intarray')
 subdir('isn')
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 4e7b87a42f..469a1746b9 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -113,6 +113,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
+ &icu-multilib;
  &intagg;
  &intarray;
  &isn;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 0d6be9a2fa..091ce0a60b 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -131,6 +131,7 @@
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">
+<!ENTITY icu-multilib    SYSTEM "icu-multilib.sgml">
 <!ENTITY intagg          SYSTEM "intagg.sgml">
 <!ENTITY intarray        SYSTEM "intarray.sgml">
 <!ENTITY isn             SYSTEM "isn.sgml">
diff --git a/doc/src/sgml/icu-multilib.sgml b/doc/src/sgml/icu-multilib.sgml
new file mode 100644
index 0000000000..838872c880
--- /dev/null
+++ b/doc/src/sgml/icu-multilib.sgml
@@ -0,0 +1,496 @@
+<!-- doc/src/sgml/icu-multilib.sgml -->
+
+<sect1 id="icu-multilib" xreflabel="icu_multilib">
+ <title>icu_multilib</title>
+
+ <indexterm zone="icu-multilib">
+  <primary>icu_multilib</primary>
+ </indexterm>
+
+ <para>
+  The <filename>icu_multilib</filename> module provides control over the
+  version (or versions) of the ICU provider library used by
+  <productname>PostgreSQL</productname>, which can be different from the
+  version of ICU with which it was built.
+ </para>
+
+ <para>
+  Collations are a product of natural language, and natural language evolves
+  over time; but <productname>PostgreSQL</productname> depends on stable
+  ordering for structures such as indexes. Newer versions of ICU update the
+  provided collators to adapt to changes in natural language, so it's
+  important to control when and how those new versions of ICU are used to
+  prevent problems such as index corruption.
+ </para>
+
+ <para>
+  This module assumes that the necessary versions of ICU are already
+  available, such as through the operating system's package manager; and
+  already properly installed in a single location accessible to
+  <productname>PostgreSQL</productname>. The configration variable
+  <literal>icu_multilib.library_path</literal> should be set to the location
+  where these ICU library versions are installed.
+ </para>
+
+ <para>
+  <filename>icu_multilib</filename> must be loaded via
+  <literal>shared_preload_libraries</literal>.
+  <filename>icu_multilib</filename> ignores any ICU library with a major
+  version greater than that with which <productname>PostgreSQL</productname>
+  was built.
+ </para>
+
+ <sect2 id="icu-multilib-version-stability">
+  <title>ICU Version Stability</title>
+
+  <para>
+   The simplest way to use <filename>icu_multilib</filename> is to force ICU
+   version stability. Version stability means that
+   <productname>PostgreSQL</productname> will use the same version of ICU for
+   all collations, even if the operating system or
+   <productname>PostgreSQL</productname> itself is upgraded or moved. The
+   stable version may or may not be the same as that with which
+   <productname>PostgreSQL</productname> was built.
+  </para>
+
+  <para>
+   Version stability may be based on both the major and minor version of ICU,
+   such as <literal>68.2</literal>; or on the major version alone, such as
+   <literal>68</literal>. Specifying both the major and minor versions more
+   closely controls exactly which ICU library is loaded, but applying minor
+   version upgrades of the ICU library may interfere with normal
+   operation. Specifying the major version alone makes applying ICU minor
+   version updates simpler, but carries a risk of subtle differences in
+   collation order if the library is updated.
+  </para>
+
+  <para>
+   To configure for version stability, first make sure that
+   <varname>icu_multilib.library_path</varname> is set correctly, which you
+   can observe by executing the
+   <literal>icu_multilib.library_versions()</literal> function. Then set
+   <varname>icu_multilib.default_icu_version</varname> to the stable version,
+   and set <varname>icu_multilib.search_by_collator_version</varname> to
+   <literal>false</literal>. These configuration variables should be set in
+   <filename>postgresql.conf</filename> to ensure that the stable version of
+   ICU is used consistently.
+  </para>
+
+  <note>
+   <para>
+    Ensure that the stable version of the ICU library is always available,
+    including on all replicas and through operating system and
+    <productname>PostgreSQL</productname> upgrades.
+   </para>
+   <para>
+    Similarly, keep the settings in <filename>postgresql.conf</filename>, and
+    keep them consistent across replicas and upgrades.
+   </para>
+  </note>
+
+  <para>
+   For best results, stabilize the ICU version immediately after
+   <command>initdb</command> is run, to ensure that initial collations are
+   loaded from the stabilized version of the ICU library. To do this, use the
+   option <xref linkend="app-initdb-option-no-import-collations"/> for
+   <command>initdb</command>, import the collations manually, and then refresh
+   the versions for the default collations. For instance:
+  </para>
+
+<programlisting>
+$ initdb --no-import-collations -D data --locale-provider=icu --icu-locale="en_US" --locale="en_US.UTF-8"
+$ cat - >> data/postgresql.conf
+shared_preload_libraries = 'icu_multilib'
+icu_multilib.library_path = '/path/to/icu/lib'
+icu_multilib.default_icu_version = '65.1'
+icu_multilib.search_by_collator_version = false
+icu_multilib.include_builtin = false
+^D
+$ pg_ctl -D data -l logfile start
+$ psql template1
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  database "template1" has a collation version mismatch
+DETAIL:  The database was created using collation version 153.112, but the operating system provides version 153.97.
+HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE template1 REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
+=# ALTER DATABASE template1 REFRESH COLLATION VERSION;
+NOTICE:  changing version from 153.112 to 153.97
+ALTER DATABASE
+=# SELECT pg_import_system_collations('pg_catalog');
+ pg_import_system_collations
+-----------------------------
+                         792
+(1 row)
+
+=# \q
+$ psql postgres
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  icu_multilib: collator version mismatch detected for locale "en_US"
+DETAIL:  ICU 65.1 provides collator version "153.97" for locale "en_US"; expected version "153.112".
+WARNING:  database "postgres" has a collation version mismatch
+DETAIL:  The database was created using collation version 153.112, but the operating system provides version 153.97.
+HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE postgres REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
+=# ALTER DATABASE postgres REFRESH COLLATION VERSION;
+NOTICE:  changing version from 153.112 to 153.97
+ALTER DATABASE
+=# SELECT pg_import_system_collations('pg_catalog');
+ pg_import_system_collations
+-----------------------------
+                         792
+(1 row)
+
+=# \q
+$
+</programlisting>
+
+  <para>
+   The result will be a system based on ICU version <literal>65.1</literal>, and
+   <productname>PostgreSQL</productname> will use that version of ICU for all
+   collations that use the ICU provider.
+  </para>
+ </sect2>
+
+ <sect2 id="icu-multi-version">
+  <title>Multiple ICU Version Support</title>
+  <para>
+   Support for multiple ICU library versions in the same database allows the
+   system to adapt to changes in natural language over time. As these changes
+   are introduced in new versions of ICU, those new versions can be brought
+   into <productname>PostgreSQL</productname> incrementally without
+   interfering with existing data (or structures like indexes).
+  </para>
+  <sect3>
+   <title>Search by Collator Version</title>
+
+   <para>
+    Collators provided by ICU also have a version which is distinct from the
+    version of ICU. For instance, in ICU version <literal>70.1</literal>, the
+    collator for the <literal>en_US</literal> locale has version
+    <literal>153.112</literal>. A collator for a given locale may or may not
+    be assigned a new version when the ICU library version is updated.
+   </para>
+
+   <para>
+    When a collation is created, <productname>PostgreSQL</productname> obtains
+    the version of the collator from the ICU library and records it in the
+    catalog. This recorded version can be updated using <literal>ALTER
+    COLLATION ... REFRESH VERSION</literal>; or <literal>ALTER DATABASE
+    ... REFRESH COLLATION VERSION</literal> if it's the database's default
+    collation. These <literal>ALTER</literal> commands obtain the new collator
+    version from the ICU library identified by
+    <varname>icu_multilib.default_icu_version</varname> if set; otherwise they
+    obtain the new collator version from the built-in ICU library.
+   </para>
+
+   <para>
+    Set <varname>icu_multilib.search_by_collator_version</varname> to
+    <literal>true</literal> to enable <filename>icu_multilib</filename> to
+    search (in descending order of ICU major version) among the
+    available libraries to find the first one that provides a collator with a
+    version that matches the one recorded in the catalog.
+   </para>
+
+   <para>
+    For example, ICU versions <literal>62.2</literal> and
+    <literal>63.2</literal> both provide collator version
+    <literal>153.88</literal> for the locale <literal>en_US</literal>; while
+    ICU version <literal>64.2</literal> provides collator version
+    <literal>153.97</literal> for the same locale. Searching for an ICU
+    library for a collation with a recorded version of
+    <literal>153.88</literal> will find the ICU library with version
+    <literal>63.2</literal>, because that's the ICU library with the highest
+    major version (<literal>63</literal>) that provides collator version
+    <literal>153.88</literal>. The function
+    <function>icu_multilib.collator_version_search</function> simulates this
+    search for the given locale and version:
+   </para>
+<programlisting>
+SELECT * FROM icu_multilib.collator_version_search('en_US', '153.88');
+ icu_version | uca_version | collator_version
+-------------+-------------+------------------
+ 63.2        | 11.0        | 153.88
+(1 row)
+
+</programlisting>
+
+   <note>
+    <para>
+     Searching by collator version relies on the assumption that collators
+     with the same version behave identically regardless of which ICU library
+     provides it. The administrator must weigh the risks of this assumption
+     against the convenience when managing multiple major versions of ICU.
+    </para>
+   </note>
+  </sect3>
+ </sect2>
+
+ <sect2 id="icu-multilib-configuration-parameters">
+  <title>Configuration Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.library_path</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.library_path</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      The filesystem path in which to search for ICU libraries. Must be set
+      for <filename>icu_multilib</filename> to be useful. Multiple major
+      versions of ICU may be present; but for each major version, only a
+      single minor version can be present.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.version_mismatch_log_level</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.version_mismatch_log_level</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      A version mismatch happens when the version of a collator provided by
+      the ICU library is different than then collator version recorded when
+      the collation was created (or last refreshed).
+     </para>
+     <para>
+      When a version mismatch is detected, a log message is emitted at this
+      level. <literal>WARNING</literal> is the default. If the level is set to
+      <literal>ERROR</literal>, it may interfere with normal operation.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.library_search_log_level</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.library_search_log_level</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      <filename>icu_multilib</filename> emits messages at this level while
+      determining which ICU library to choose. The default is
+      <literal>DEBUG1</literal>. This setting is useful to diagnose problems
+      when a suitable ICU library is not found.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.default_icu_version</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.default_icu_version</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Identifies the version of the ICU library to choose; or, if <varname>icu_multilib.search_by_collator_version</varname> is <literal>true</literal>, the ICU library to choose if no identical match to the collator version is found. The default is the empty string, meaning to use the built-in ICU library.
+     </para>
+     <para>
+      Valid values are either of the form <literal>major.minor</literal> such
+      as <literal>70.1</literal>; or a major version number alone, such as
+      <literal>70</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.include_builtin</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.include_builtin</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Determines whether the built-in version of ICU is included when
+      searching for a suitable ICU library. The default is
+      <literal>true</literal>.
+     </para>
+     <para>
+      Even if set to <literal>false</literal>, the built-in ICU library may be
+      used if no other suitable ICU library is identified and
+      <varname>icu_multilib.default_icu_version</varname> is set to the empty
+      string.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <varname>icu_multilib.search_by_collator_version</varname> (<type>string</type>)
+     <indexterm>
+      <primary><varname>icu_multilib.search_by_collator_version</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Causes <filename>icu_multilib</filename> to identify the ICU library
+      with the highest major version that offers the required collator with an
+      exactly-matching collator version. The default is <literal>false</literal>.
+     </para>
+     <para>
+      When set to <literal>true</literal>, <filename>icu_multilib</filename>
+      relies on equal collator versions to produce identical collation order.
+     </para>
+     <para>
+      If no ICU library is found with an exactly-matching collator version,
+      <filename>icu_multilib</filename> will fall back to the
+      <varname>icu_multilib.default_icu_version</varname>; or if not set, to
+      the built-in ICU library.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Functions</title>
+
+  <table id="icu-multilib-functions">
+   <title><filename>icu_multilib</filename> Functions</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       Function
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.library_versions</function> ()
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>unicode_version</parameter> <type>text</type>,
+       <parameter>cldr_version</parameter> <type>text</type>,
+       <parameter>libicui18n_name</parameter> <type>text</type>,
+       <parameter>libicuuc_name</parameter> <type>text</type>)
+      </para>
+      <para>
+       Returns details for each available ICU library found in
+       <varname>icu_multilib.library_path</varname>. Also includes the
+       built-in ICU library iv <varname>icu_multilib.include_builtin</varname>
+       is <literal>true</literal>.
+      </para></entry>
+     </row>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.library_collators</function> ( <parameter>major_version</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>null</literal> )
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>locale</parameter> <type>text</type>,
+       <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>uca_version</parameter> <type>text</type>,
+       <parameter>collator_version</parameter> <type>text</type> )
+      </para>
+      <para>
+       Returns details for all available collators provided by the ICU library
+       with the given <parameter>major_version</parameter>; or by the built-in
+       ICU library if <parameter>major_version</parameter> is
+       <literal>null</literal>.
+      </para></entry>
+     </row>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.collator_version_search</function> ( <parameter>locale</parameter> <type>text</type>, <parameter>requested_version</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>null</literal>, <parameter>log_ok</parameter> <type>boolean</type> <literal>DEFAULT</literal> <literal>false</literal> )
+       <returnvalue>record</returnvalue>
+       ( <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>uca_version</parameter> <type>text</type>,
+       <parameter>collator_version</parameter> <type>text</type> )
+      </para>
+      <para>
+       Performs a search for the appropriate ICU library given the locale name
+       and requested collator version, and returns details about the ICU
+       library.
+      </para>
+      <para>
+       If <parameter>requested_version</parameter> is <literal>null</literal>,
+       it will return the ICU library identified by
+       <varname>icu_multilib.default_icu_version</varname> if set to
+       <literal>true</literal>; otherwise the built-in ICU library.
+      </para>
+      <para>
+       If <parameter>log_ok</parameter> is <literal>true</literal>, the search
+       may emit log messages at the level
+       <varname>icu_multilib.library_search_log_level</varname>, which may be
+       useful for diagnosing misconfiguration.
+      </para></entry>
+     </row>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <function>icu_multilib.collator_versions</function> ( <parameter>locale</parameter> <type>text</type> )
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>icu_version</parameter> <type>text</type>,
+       <parameter>uca_version</parameter> <type>text</type>,
+       <parameter>collator_version</parameter> <type>text</type> )
+      </para>
+      <para>
+       Returns details about available collators for the given
+       <parameter>locale</parameter> from all available ICU libraries.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+ </sect2>
+
+ <sect2 id="icu-multilib-examples">
+  <title>Examples</title>
+
+<screen>
+SELECT icu_version, unicode_version, cldr_version FROM icu_multilib.library_versions() LIMIT 3;
+ icu_version | unicode_version | cldr_version
+-------------+-----------------+--------------
+ 70.1        | 14.0            | 40.0
+ 69.1        | 13.0            | 39.0
+ 68.2        | 13.0            | 38.1
+(3 rows)
+</screen>
+
+<screen>
+SELECT * FROM icu_multilib.collator_versions('en_US') WHERE icu_version BETWEEN '60.0' and '65.0';
+ icu_version | uca_version | collator_version
+-------------+-------------+------------------
+ 64.2        | 12.1        | 153.97
+ 63.2        | 11.0        | 153.88
+ 62.2        | 11.0        | 153.88
+ 61.2        | 10.0        | 153.80
+ 60.3        | 10.0        | 153.80
+(5 rows)
+</screen>
+
+<screen>
+SELECT * FROM icu_multilib.collator_version_search('en_US', '153.97');
+ icu_version | uca_version | collator_version
+-------------+-------------+------------------
+ 65.1        | 12.1        | 153.97
+(1 row)
+</screen>
+
+ </sect2>
+
+ <sect2 id="icu-multilib-author">
+  <title>Authors</title>
+
+  <para>
+   Jeff Davis <email>jdavis@postgresql.org</email>, Thomas Munro <email>tmunro@postgresql.org</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
2.34.1

v8-0003-Add-initdb-option-no-import-collations.patchtext/x-patch; charset=UTF-8; name=v8-0003-Add-initdb-option-no-import-collations.patchDownload

From 966cdc29309965c3dc5a5608c38c36faeb9cbb77 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 18 Jan 2023 12:32:56 -0800
Subject: [PATCH v8 3/4] Add initdb option --no-import-collations.

In preparation for the icu_multilib contrib module, which allows
greater control over how ICU collations and how different ICU library
versions are managed and used.

By not creating the collations during initdb time, the user will be
able to run pg_import_system_collations() themselves using a specific
version of ICU.
---
 doc/src/sgml/ref/initdb.sgml | 11 +++++++++++
 src/bin/initdb/initdb.c      | 10 +++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..5a2da829fb 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -305,6 +305,17 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-import-collations">
+      <term><option>--no-import-collations</option></term>
+      <listitem>
+       <para>
+        Prevents <command>initdb</command> from importing collations from the
+        system. This can be used to more carefully control which collations
+        are initially loaded in each database.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-sync">
       <term><option>-N</option></term>
       <term><option>--no-sync</option></term>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..acaeb00086 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -135,6 +135,7 @@ static char *lc_time = NULL;
 static char *lc_messages = NULL;
 static char locale_provider = COLLPROVIDER_LIBC;
 static char *icu_locale = NULL;
+static bool import_system_collations = true;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
 static bool pwprompt = false;
@@ -1491,7 +1492,8 @@ setup_collation(FILE *cmdfd)
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, PG_UTF8);
 
 	/* Now import all collations we can find in the operating system */
-	PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n");
+	if (import_system_collations)
+		PG_CMD_PUTS("SELECT pg_import_system_collations('pg_catalog');\n\n");
 }
 
 /*
@@ -2116,6 +2118,8 @@ usage(const char *progname)
 	printf(_("      --no-locale           equivalent to --locale=C\n"));
 	printf(_("      --locale-provider={libc|icu}\n"
 			 "                            set default locale provider for new databases\n"));
+	printf(_("      --no-import-collations\n"
+			 "                            do not import collations from the system"));
 	printf(_("      --pwfile=FILE         read password for the new superuser from file\n"));
 	printf(_("  -T, --text-search-config=CFG\n"
 			 "                            default text search configuration\n"));
@@ -2767,6 +2771,7 @@ main(int argc, char *argv[])
 		{"discard-caches", no_argument, NULL, 14},
 		{"locale-provider", required_argument, NULL, 15},
 		{"icu-locale", required_argument, NULL, 16},
+		{"no-import-collations", no_argument, NULL, 17},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -2924,6 +2929,9 @@ main(int argc, char *argv[])
 			case 16:
 				icu_locale = pg_strdup(optarg);
 				break;
+			case 17:
+				import_system_collations = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
-- 
2.34.1

v8-0002-Add-test-module-for-icu-collation-provider-hook.patchtext/x-patch; charset=UTF-8; name=v8-0002-Add-test-module-for-icu-collation-provider-hook.patchDownload

From 540559f8104943eaea9a657ecf16326b31e7e545 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 21 Dec 2022 19:50:10 -0800
Subject: [PATCH v8 2/4] Add test module for icu collation provider hook.

---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_collator_lib_hooks/Makefile  |  24 ++
 .../test_collator_lib_hooks/meson.build       |  37 +++
 .../test_collator_lib_hooks/t/001_icu.pl      | 153 ++++++++++++
 .../test_collator_lib_hooks.c                 |  43 ++++
 .../test_collator_lib_hooks.control           |   4 +
 .../test_collator_lib_hooks.h                 |  28 +++
 .../test_collator_lib_hooks/test_icu_hook.c   | 228 ++++++++++++++++++
 9 files changed, 519 insertions(+)
 create mode 100644 src/test/modules/test_collator_lib_hooks/Makefile
 create mode 100644 src/test/modules/test_collator_lib_hooks/meson.build
 create mode 100644 src/test/modules/test_collator_lib_hooks/t/001_icu.pl
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
 create mode 100644 src/test/modules/test_collator_lib_hooks/test_icu_hook.c

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..8a90b206a1 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -15,6 +15,7 @@ SUBDIRS = \
 		  snapshot_too_old \
 		  spgist_name_ops \
 		  test_bloomfilter \
+		  test_collator_lib_hooks \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
 		  test_ddl_deparse \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..a59cc9a89f 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -11,6 +11,7 @@ subdir('snapshot_too_old')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
+subdir('test_collator_lib_hooks')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
 subdir('test_ddl_deparse')
diff --git a/src/test/modules/test_collator_lib_hooks/Makefile b/src/test/modules/test_collator_lib_hooks/Makefile
new file mode 100644
index 0000000000..b44c8f5914
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_collator_lib_hooks/Makefile
+
+MODULE_big = test_collator_lib_hooks
+OBJS = \
+	$(WIN32RES) \
+	test_collator_lib_hooks.o test_icu_hook.o
+PGFILEDESC = "test_collator_lib_hooks - test collator provider library hooks"
+
+EXTENSION = test_collator_lib_hooks
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_collator_lib_hooks
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+export with_icu
diff --git a/src/test/modules/test_collator_lib_hooks/meson.build b/src/test/modules/test_collator_lib_hooks/meson.build
new file mode 100644
index 0000000000..0190a856e0
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/meson.build
@@ -0,0 +1,37 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_collator_lib_hooks_sources = files(
+  'test_collator_lib_hooks.c',
+  'test_icu_hook.c',
+)
+
+if host_system == 'windows'
+  test_collator_lib_hooks_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_collator_lib_hooks',
+    '--FILEDESC', 'test_collator_lib_hooks - test collation provider library hooks',])
+endif
+
+test_collator_lib_hooks = shared_module('test_collator_lib_hooks',
+  test_collator_lib_hooks_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_collator_lib_hooks
+
+install_data(
+  'test_collator_lib_hooks.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_collator_lib_hooks',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_icu.pl',
+    ],
+    'env': {
+      'with_icu': icu.found() ? 'yes' : 'no',
+    },
+  },
+}
diff --git a/src/test/modules/test_collator_lib_hooks/t/001_icu.pl b/src/test/modules/test_collator_lib_hooks/t/001_icu.pl
new file mode 100644
index 0000000000..d3b5729527
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/t/001_icu.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collator_lib_hooks'
+});
+$node->start;
+
+$node->safe_psql('postgres',
+	q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' ICU_LOCALE 'DESC' ENCODING 'UTF8' TEMPLATE template0}
+);
+
+# setup
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_asc (PROVIDER=icu, LOCALE='ASC')]);
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_desc (PROVIDER=icu, LOCALE='DESC')]);
+
+$node->safe_psql('dbicu', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('dbicu',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $version_db =
+  $node->safe_psql('dbicu',
+	  qq[SELECT datcollversion FROM pg_database WHERE datname='dbicu']);
+is($version_db, '2.72',
+	'database "dbicu" has correct version 2.72'
+);
+
+my $version_asc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, '2.72',
+	'collation "test_asc" has correct version 2.72'
+);
+
+my $version_desc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, '2.72',
+	'collation "test_desc" has correct version 2.72'
+);
+
+my $res_sort_expected = "aBcD
+fGhI
+wXyZ";
+
+my $res_reversesort_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison, 't',
+	'correct comparison'
+);
+
+# test reverse comparison (database)
+
+my $dbcomparison_reverse =
+  $node->safe_psql('dbicu', qq[SELECT 'aBcD' < 'wXyZ']);
+is($dbcomparison_reverse, 'f',
+	'correct reverse comparison (database)'
+);
+
+# test reverse comparison
+
+my $comparison_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_reverse, 'f',
+	'correct reverse comparison'
+);
+
+# test asc sort
+
+my $res_sort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($res_sort, $res_sort_expected,
+	'correct ascending sort'
+);
+
+# test desc sort
+
+my $res_db_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t]);
+is($res_db_reversesort, $res_reversesort_expected,
+	'correct descending sort (database)'
+);
+
+# test desc sort
+
+my $res_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($res_reversesort, $res_reversesort_expected,
+	'correct descending sort'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test reverse lower/upper (database)
+
+my $tcase_db_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ'),
+                upper('aBcDfgHiwXyZ')]);
+is($tcase_db_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase (database)'
+);
+
+# test reverse lower/upper
+
+my $tcase_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
new file mode 100644
index 0000000000..45a849b52d
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
@@ -0,0 +1,43 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collator_lib_hooks.c
+ *		Code for testing collation provider library hooks
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.c
+ *
+ * Test implementation of icu-like collation provider.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+
+#include "test_collator_lib_hooks.h"
+
+#ifdef USE_ICU
+static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
+#endif
+
+PG_MODULE_MAGIC;
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		ereport(ERROR, (errmsg("test_collator_lib_hooks must be loaded via shared_preload_libraries")));
+
+#ifdef USE_ICU
+	prev_get_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = test_get_icu_library;
+#endif
+}
diff --git a/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control
new file mode 100644
index 0000000000..a0b8e031a4
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.control
@@ -0,0 +1,4 @@
+comment = 'Test code for collation provider library hooks'
+default_version = '1.0'
+module_pathname = '$libdir/test_collation_lib_hooks'
+
diff --git a/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
new file mode 100644
index 0000000000..c3a9ff45e2
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
@@ -0,0 +1,28 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collator_lib_hooks.h
+ *		Definitions for collation library hooks.
+ *
+ * Copyright (c) 2015-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collator_lib_hooks/test_collator_lib_hooks.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#ifndef TEST_COLLATOR_LIB_HOOKS_H
+#define TEST_COLLATOR_LIB_HOOKS_H
+
+#include "postgres.h"
+
+#include "utils/memutils.h"
+#include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
+
+#ifdef USE_ICU
+extern pg_icu_library *test_get_icu_library(Oid collid, const char *locale,
+											const char *version);
+#endif
+
+#endif
diff --git a/src/test/modules/test_collator_lib_hooks/test_icu_hook.c b/src/test/modules/test_collator_lib_hooks/test_icu_hook.c
new file mode 100644
index 0000000000..423ff30b5b
--- /dev/null
+++ b/src/test/modules/test_collator_lib_hooks/test_icu_hook.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_icu_hook.c
+ *		Code for testing collation provider icu hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collator_lib_hooks/test_icu_hook.c
+ *
+ * Implements a custom icu-like collation provider library for testing the
+ * hooks. It accepts any collator name requested. All behave exactly like the
+ * "en_US" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 2.72, so loading it will cause a version
+ * mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collator_lib_hooks.h"
+
+#ifdef USE_ICU
+
+#include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
+
+#define TEST_LOCALE "en_US"
+
+typedef struct TestUCollator {
+	UCollator	*ucol;
+	bool		 reverse;
+} TestUCollator;
+
+static pg_icu_library *test_icu_library = NULL;
+static const UVersionInfo test_icu_version = { 2, 72 };
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static UCollator *
+test_openCollator(const char *loc, UErrorCode *status)
+{
+	TestUCollator *testcol = MemoryContextAlloc(TopMemoryContext, sizeof(TestUCollator));
+	UCollator *ucol = ucol_open(TEST_LOCALE, status);
+	testcol->ucol = ucol;
+	testcol->reverse = locale_is_reverse(loc);
+	return (UCollator *)testcol;
+}
+
+static void
+test_closeCollator(UCollator *coll)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_close(testcol->ucol);
+	pfree(testcol);
+}
+
+static void
+test_setAttribute(UCollator *coll, UColAttribute attr,
+				  UColAttributeValue value, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_setAttribute(testcol->ucol, attr, value, status);
+}
+
+static void
+test_getCollatorVersion(const UCollator *coll, UVersionInfo info)
+{
+	memcpy(info, test_icu_version, sizeof(UVersionInfo));
+}
+
+static UCollationResult
+test_strcoll(const UCollator *coll, const UChar *source, int32_t sourceLength,
+			 const UChar *target, int32_t targetLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcoll(testcol->ucol, source, sourceLength,
+										target, targetLength);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static UCollationResult
+test_strcollUTF8(const UCollator *coll, const char *source,
+				 int32_t sourceLength, const char *target,
+				 int32_t targetLength, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcollUTF8(testcol->ucol, source,
+											sourceLength, target,
+											targetLength, status);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static int32_t
+test_getSortKey(const UCollator *coll, const UChar *source,
+				int32_t sourceLength, uint8_t *result, int32_t resultLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_getSortKey(testcol->ucol, source, sourceLength,
+								  result, resultLength);
+	size_t result_size = ret + 1;
+
+	if (resultLength >= result_size)
+	{
+		result[resultLength] = '\0';
+
+		if (testcol->reverse)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) result + i) ^= (unsigned char) 0xff;
+	}
+
+	return result_size;
+}
+
+static int32_t
+test_nextSortKeyPart(const UCollator *coll, UCharIterator *iter,
+					 uint32_t state[2], uint8_t *dest, int32_t count,
+					 UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_nextSortKeyPart(testcol->ucol, iter, state, dest,
+									   count, status);
+
+	if (testcol->reverse)
+		for (int i = 0; i < ret; i++)
+			*((unsigned char *) dest + i) ^= (unsigned char) 0xff;
+
+	/*
+	 * The following is not correct for cases where we finish precisely on the
+	 * boundary (i.e. count is exactly enough). To fix this we'd need to track
+	 * additional state across calls, which doesn't seem worth it for a test
+	 * case.
+	 */
+	if (count >= ret && ret > 0)
+	{
+		if (testcol->reverse)
+			dest[ret] = 0xff;
+		else
+			dest[ret] = '\0';
+		return ret + 1;
+	}
+
+	return ret;
+}
+
+static int32_t
+test_strToUpper(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+static int32_t
+test_strToLower(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+pg_icu_library *
+test_get_icu_library(Oid collid, const char *locale, const char *version)
+{
+	pg_icu_library *lib;
+
+	if (test_icu_library != NULL)
+		return test_icu_library;
+
+	ereport(LOG, (errmsg("loading custom ICU provider for test_collator_lib_hooks")));
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = test_openCollator;
+	lib->closeCollator = test_closeCollator;
+	lib->getCollatorVersion = test_getCollatorVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = test_strcoll;
+	lib->strcollUTF8 = test_strcollUTF8;
+	lib->getSortKey = test_getSortKey;
+	lib->nextSortKeyPart = test_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = test_strToUpper;
+	lib->strToLower = test_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = test_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	test_icu_library = lib;
+	return lib;
+}
+
+#endif				/* USE_ICU */
-- 
2.34.1

v8-0001-Support-multiple-ICU-collation-provider-libraries.patchtext/x-patch; charset=UTF-8; name=v8-0001-Support-multiple-ICU-collation-provider-libraries.patchDownload

From 76c3d824ad777f259ea37c5eb2f6454999d8fb48 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 7 Dec 2022 11:07:31 -0800
Subject: [PATCH v8 1/4] Support multiple ICU collation provider libraries.

Introduce structure pg_icu_library, which holds pointers to each
required ICU method, and store this as part of pg_locale_t. Each call
to an ICU function instead goes through this structure, so that it can
more easily be replaced by a non-builtin ICU library.

Offer a hook to allow an extension to control which ICU library is
loaded for a given locale and version.

Author: Thomas Munro, Jeff Davis
---
 src/backend/commands/collationcmds.c   |  17 +-
 src/backend/utils/adt/formatting.c     |  67 ++++--
 src/backend/utils/adt/pg_locale.c      | 316 ++++++++++++++++++-------
 src/include/utils/pg_locale_internal.h | 124 +++++++++-
 4 files changed, 414 insertions(+), 110 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index d878be2fb8..217c13f066 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -566,13 +566,14 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
+	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
+						localename, iculib->errorName(status))));
 
 	return pstrdup(buf);
 }
@@ -591,11 +592,12 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
+	len_uchar = iculib->getDisplayName(localename, "en",
+									   displayname, lengthof(displayname),
+									   &status);
 	if (U_FAILURE(status))
 		return NULL;			/* no good reason to raise an error */
 
@@ -921,12 +923,13 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
+		pg_icu_library *iculib = get_default_icu_library();
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
 		 * code duplication.
 		 */
-		for (i = -1; i < uloc_countAvailable(); i++)
+		for (i = -1; i < iculib->countAvailable(); i++)
 		{
 			const char *name;
 			char	   *langtag;
@@ -937,7 +940,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			if (i == -1)
 				name = "";		/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				name = iculib->getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f45e2043a1..0a683cd816 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1566,6 +1566,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1573,6 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 {
 	UErrorCode	status;
 	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
 	len_dest = len_source;		/* try first with same length */
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
@@ -1590,18 +1596,42 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->ctype, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->ctype, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1667,11 +1697,12 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			int32_t		len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToLower, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1789,11 +1820,12 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToUpper, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1912,11 +1944,12 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case_bi(iculib->strToTitle, mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index d3ca3c5011..10b179cb48 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -70,6 +70,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -110,6 +112,41 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/*
+ * The precise version of a collation provider library is important, because
+ * subtle changes in collation between library versions can corrupt
+ * indexes. This hook allows control over how collation provider libraries are
+ * loaded.
+ *
+ * If the hook is not set, or if it returns NULL, Postgres constructs the
+ * pg_icu_library structure from the symbols Postgres is linked with at build
+ * time.
+ *
+ * If the hook is set, it can instead construct the pg_icu_library structure
+ * using custom logic. Ordinarily, this custom logic would involve finding a
+ * specific known version of the collation provider library, and dynamically
+ * loading the necessary symbols. If the collation version obtained from the
+ * library does not match the collation version recorded in the catalog,
+ * Postgres will issue a WARNING.
+ *
+ * The returned structure must be allocated in TopMemoryContext, and the
+ * associated symbols must remain valid permanently. It's expected that the
+ * hook will maintain a small finite list of libraries and return a pointer to
+ * the appropraite one, so that resource use remains bounded.
+ *
+ * XXX: For now, the only information the hook has access to is the ICU locale
+ * name, and the collation version as recorded in the catalog. We should
+ * consider what other information can be provided to allow for greater
+ * control over which library is loaded.
+ */
+#ifdef USE_ICU
+get_icu_library_hook_type get_icu_library_hook = NULL;
+#endif
+
+#ifdef USE_ICU
+static pg_icu_library *builtin_icu_library = NULL;
+#endif
+
 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
@@ -136,8 +173,12 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+#ifdef USE_ICU
+static pg_icu_library *default_icu = NULL;
+#endif
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -145,13 +186,17 @@ static pg_locale_t default_locale = NULL;
  */
 static UConverter *icu_converter = NULL;
 
-static void init_icu_converter(void);
-static size_t uchar_length(UConverter *converter,
+static void init_icu_converter(pg_icu_library *iculib);
+static size_t uchar_length(pg_icu_library *iculib,
+						   UConverter *converter,
 						   const char *str, size_t len);
-static int32_t uchar_convert(UConverter *converter,
+static int32_t uchar_convert(pg_icu_library *iculib,
+							 UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, size_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *iculib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1458,6 +1503,85 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+#ifdef USE_ICU
+pg_icu_library *
+get_builtin_icu_library()
+{
+	UVersionInfo version_info;
+	pg_icu_library *lib;
+
+	if (builtin_icu_library != NULL)
+		return builtin_icu_library;
+
+	/*
+	 * These assignments will fail to compile if an incompatible API change is
+	 * made to some future version of ICU, at which point we might need to
+	 * consider special treatment for different major version ranges, with
+	 * intermediate trampoline functions.
+	 */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = ucol_open;
+	lib->closeCollator = ucol_close;
+	lib->getCollatorVersion = ucol_getVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = ucol_strcoll;
+	lib->strcollUTF8 = ucol_strcollUTF8;
+	lib->getSortKey = ucol_getSortKey;
+	lib->nextSortKeyPart = ucol_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = u_strToUpper;
+	lib->strToLower = u_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = ucol_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	lib->getICUVersion(version_info);
+	lib->major_version = version_info[0];
+	lib->minor_version = version_info[1];
+	lib->libicui18n_name = "<builtin>";
+	lib->libicuuc_name = "<builtin>";
+
+	/*
+	 * Also assert the size of a couple of types used as output buffers, as a
+	 * canary to tell us to add extra padding in the (unlikely) event that a
+	 * later release makes these values smaller.
+	 */
+	StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+					 "u_versionToString output buffer size changed incompatibly");
+	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+					 "ucol_getVersion output buffer size changed incompatibly");
+
+	builtin_icu_library = lib;
+	return lib;
+}
+
+static pg_icu_library *
+get_icu_library(Oid collid, const char *locale, const char *version)
+{
+	pg_icu_library *lib = NULL;
+
+	if (get_icu_library_hook != NULL)
+		lib = get_icu_library_hook(collid, locale, version);
+
+	if (!lib)
+		lib = get_builtin_icu_library();
+
+	return lib;
+}
+
+#endif
 
 /*
  * Construct a new pg_locale_t object.
@@ -1477,8 +1601,8 @@ report_newlocale_failure(const char *localename)
  * effectively permanent.
  */
 static pg_locale_t
-pg_newlocale(char provider, bool deterministic, const char *collate,
-			 const char *ctype, const char *version)
+pg_newlocale(Oid collid, char provider, bool deterministic,
+			 const char *collate, const char *ctype, const char *version)
 {
 	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
 											sizeof(struct pg_locale_struct));
@@ -1554,20 +1678,24 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
+		pg_icu_library *iculib = get_icu_library(collid, collate, version);
+
+		Assert(strcmp(collate, ctype) == 0);
 
 		/* collator may be leaked if we encounter an error */
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collate, &status);
+		collator = iculib->openCollator(collate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collate, u_errorName(status))));
+							collate, iculib->errorName(status))));
 
 		if (U_ICU_VERSION_MAJOR_NUM < 54)
-			icu_set_collation_attributes(collator, collate);
+			icu_set_collation_attributes(iculib, collator, collate);
 
 		result->info.icu.ucol = collator;
+		result->info.icu.lib = iculib;
 	}
 #endif
 	else
@@ -1589,6 +1717,14 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+#ifdef USE_ICU
+pg_icu_library *
+get_default_icu_library()
+{
+	return default_icu;
+}
+#endif
+
 /*
  * Initialize default database locale.
  */
@@ -1596,6 +1732,10 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+#ifdef USE_ICU
+	default_icu = get_icu_library(DEFAULT_COLLATION_OID, iculocale, version);
+#endif
+
 	/*
 	 * For the purposes of pg_locale_t, if the provider is ICU, we use
 	 * iculocale for both collate and ctype.
@@ -1613,7 +1753,8 @@ init_default_locale(char provider, const char *collate, const char *ctype,
 	 * locales currently don't support pattern matching, which would break a
 	 * lot of things if applied globally.
 	 */
-	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+	default_locale = pg_newlocale(DEFAULT_COLLATION_OID, provider, true,
+								  collate, ctype, version);
 }
 
 /*
@@ -1699,7 +1840,7 @@ pg_newlocale_from_collation(Oid collid)
 			/* shouldn't happen */
 			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		locale = pg_newlocale(collform->collprovider,
+		locale = pg_newlocale(collid, collform->collprovider,
 							  collform->collisdeterministic,
 							  collate, ctype, collversionstr);
 
@@ -1759,17 +1900,19 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*iculib = get_icu_library(InvalidOid, collcollate,
+												  NULL);
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = iculib->openCollator(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, iculib->errorName(status))));
+		iculib->getCollatorVersion(collator, versioninfo);
+		iculib->closeCollator(collator);
 
-		u_versionToString(versioninfo, buf);
+		iculib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -2021,16 +2164,17 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	UChar	*uchar1,
 			*uchar2;
 	int		 result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 #ifdef HAVE_UCOL_STRCOLLUTF8
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 #endif
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
+	ulen1 = uchar_length(iculib, icu_converter, arg1, len1);
+	ulen2 = uchar_length(iculib, icu_converter, arg2, len2);
 
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
@@ -2041,12 +2185,12 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	uchar1 = (UChar *) buf;
 	uchar2 = (UChar *) (buf + bufsize1);
 
-	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
-	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+	ulen1 = uchar_convert(iculib, icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(iculib, icu_converter, uchar2, ulen2 + 1, arg2, len2);
 
-	result = ucol_strcoll(locale->info.icu.ucol,
-						  uchar1, ulen1,
-						  uchar2, ulen2);
+	result = iculib->strcoll(locale->info.icu.ucol,
+							 uchar1, ulen1,
+							 uchar2, ulen2);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2067,6 +2211,7 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				pg_locale_t locale)
 {
 	int result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
@@ -2076,13 +2221,14 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		UErrorCode	status;
 
 		status = U_ZERO_ERROR;
-		result = ucol_strcollUTF8(locale->info.icu.ucol,
-								  arg1, len1,
-								  arg2, len2,
-								  &status);
+		result = iculib->strcollUTF8(locale->info.icu.ucol,
+									 arg1, len1,
+									 arg2, len2,
+									 &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("collation failed: %s", u_errorName(status))));
+					(errmsg("collation failed: %s",
+							iculib->errorName(status))));
 	}
 	else
 #endif
@@ -2242,12 +2388,13 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 	int32_t	 ulen;
 	size_t   uchar_bsize;
 	Size	 result_bsize;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2256,11 +2403,11 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
-	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
-								   uchar, ulen,
-								   (uint8_t *) dest, destsize);
+	result_bsize = iculib->getSortKey(locale->info.icu.ucol,
+									  uchar, ulen,
+									  (uint8_t *) dest, destsize);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2289,13 +2436,14 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 	UChar			*uchar = NULL;
 	size_t			 uchar_bsize;
 	Size			 result_bsize;
+	pg_icu_library	*iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2304,21 +2452,19 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
 	uiter_setString(&iter, uchar, ulen);
 	state[0] = state[1] = 0;	/* won't need that again */
 	status = U_ZERO_ERROR;
-	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
-										&iter,
-										state,
-										(uint8_t *) dest,
-										destsize,
-										&status);
+	result_bsize = iculib->nextSortKeyPart(
+		locale->info.icu.ucol, &iter, state,
+		(uint8_t *) dest, destsize, &status);
+
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("sort key generation failed: %s",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return result_bsize;
 }
@@ -2327,6 +2473,7 @@ static size_t
 pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 					   size_t destsize, pg_locale_t locale)
 {
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 	size_t result;
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
@@ -2337,19 +2484,17 @@ pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 		uint32_t	state[2];
 		UErrorCode	status;
 
-		uiter_setUTF8(&iter, src, srclen);
+		iculib->setUTF8(&iter, src, srclen);
 		state[0] = state[1] = 0;	/* won't need that again */
 		status = U_ZERO_ERROR;
-		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
-									  &iter,
-									  state,
-									  (uint8_t *) dest,
-									  destsize,
-									  &status);
+		result = iculib->nextSortKeyPart(
+			locale->info.icu.ucol, &iter, state,
+			(uint8_t *) dest, destsize, &status);
+
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("sort key generation failed: %s",
-							u_errorName(status))));
+							iculib->errorName(status))));
 	}
 	else
 		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
@@ -2546,7 +2691,7 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 
 #ifdef USE_ICU
 static void
-init_icu_converter(void)
+init_icu_converter(pg_icu_library *iculib)
 {
 	const char *icu_encoding_name;
 	UErrorCode	status;
@@ -2563,11 +2708,11 @@ init_icu_converter(void)
 						pg_encoding_to_char(GetDatabaseEncoding()))));
 
 	status = U_ZERO_ERROR;
-	conv = ucnv_open(icu_encoding_name, &status);
+	conv = iculib->openConverter(icu_encoding_name, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open ICU converter for encoding \"%s\": %s",
-						icu_encoding_name, u_errorName(status))));
+						icu_encoding_name, iculib->errorName(status))));
 
 	icu_converter = conv;
 }
@@ -2576,14 +2721,15 @@ init_icu_converter(void)
  * Find length, in UChars, of given string if converted to UChar string.
  */
 static size_t
-uchar_length(UConverter *converter, const char *str, size_t len)
+uchar_length(pg_icu_library *iculib, UConverter *converter, const char *str, size_t len)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
-	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	ulen = iculib->toUChars(converter, NULL, 0, str, len, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2592,16 +2738,17 @@ uchar_length(UConverter *converter, const char *str, size_t len)
  * return the length (in UChars).
  */
 static int32_t
-uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
-			  const char *src, size_t srclen)
+uchar_convert(pg_icu_library *iculib, UConverter *converter, UChar *dest,
+			  int32_t destlen, const char *src, size_t srclen)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
 	status = U_ZERO_ERROR;
-	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	ulen = iculib->toUChars(converter, dest, destlen, src, srclen, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2618,16 +2765,17 @@ uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
  * result length instead.
  */
 int32_t
-icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
+icu_to_uchar(pg_icu_library *iculib, UChar **buff_uchar, const char *buff,
+			 size_t nbytes)
 {
 	int32_t len_uchar;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	len_uchar = uchar_length(icu_converter, buff, nbytes);
+	len_uchar = uchar_length(iculib, icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-	len_uchar = uchar_convert(icu_converter,
+	len_uchar = uchar_convert(iculib, icu_converter,
 							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
@@ -2645,30 +2793,32 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * The result string is nul-terminated.
  */
 int32_t
-icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
+icu_from_uchar(pg_icu_library *iculib, char **result, const UChar *buff_uchar,
+			   int32_t len_uchar)
 {
 	UErrorCode	status;
 	int32_t		len_result;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, NULL, 0,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, NULL, 0,
+									buff_uchar, len_uchar, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	*result = palloc(len_result + 1);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, *result,
+									len_result + 1, buff_uchar,
+									len_uchar, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return len_result;
 }
@@ -2684,7 +2834,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *iculib, UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -2758,7 +2909,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				iculib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -2767,7 +2918,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, iculib->errorName(status))));
 		}
 	}
 }
@@ -2783,17 +2934,18 @@ check_icu_locale(const char *icu_locale)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	pg_icu_library *iculib = get_icu_library(InvalidOid, icu_locale, NULL);
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = iculib->openCollator(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, iculib->errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+		icu_set_collation_attributes(iculib, collator, icu_locale);
+	iculib->closeCollator(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 33465ad92d..b600604585 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -14,6 +14,8 @@
 #define _PG_LOCALE_INTERNAL_
 
 #ifdef USE_ICU
+#include <unicode/ubrk.h>
+#include <unicode/ucnv.h>
 #include <unicode/ucol.h>
 #endif
 
@@ -30,6 +32,108 @@
 #endif
 #endif
 
+#ifdef USE_ICU
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	char	   *libicui18n_name;
+	char	   *libicuuc_name;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*openCollator) (const char *loc, UErrorCode *status);
+	void		(*closeCollator) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	UConverter *(*openConverter) (const char *converterName,
+								  UErrorCode *  	err);
+	void		(*closeConverter) (UConverter *converter);
+	int32_t		(*fromUChars) (UConverter *cnv,
+							   char *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UErrorCode *pErrorCode);
+	int32_t		(*toUChars) (UConverter *cnv,
+							 UChar *dest,
+							 int32_t destCapacity,
+							 const char *src,
+							 int32_t srcLength,
+							 UErrorCode *pErrorCode);
+	int32_t		(*toLanguageTag) (const char *localeID,
+								  char *langtag,
+								  int32_t langtagCapacity,
+								  UBool strict,
+								  UErrorCode *err);
+	int32_t		(*getDisplayName) (const char *localeID,
+								   const char *inLocaleID,
+								   UChar *result,
+								   int32_t maxResultSize,
+								   UErrorCode *err);
+	int32_t		(*countAvailable) (void);
+	const char *(*getAvailable) (int32_t n);
+} pg_icu_library;
+
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+
+#endif
+
 /*
  * We define our own wrapper around locale_t so we can keep the same
  * function signatures for all builds, while not having to create a
@@ -53,7 +157,8 @@ struct pg_locale_struct
 #ifdef USE_ICU
 		struct
 		{
-			UCollator	*ucol;
+			UCollator		*ucol;
+			pg_icu_library	*lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -61,8 +166,19 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
+
+typedef pg_icu_library *(*get_icu_library_hook_type)(
+	Oid collid, const char *locale, const char *version);
+
+extern PGDLLIMPORT get_icu_library_hook_type get_icu_library_hook;
+
+extern pg_icu_library *get_default_icu_library(void);
+extern pg_icu_library *get_builtin_icu_library(void);
+extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
+							const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
+							  const UChar *buff_uchar, int32_t len_uchar);
+
+#endif							/* USE_ICU */
 
 #endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

#140

vignesh C

vignesh21@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#139)

Re: Collation version tracking for macOS

On Sat, 21 Jan 2023 at 02:24, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote:

Attached are a new set of patches, including a major enhancement: the
icu_multilib contrib module.

Attached rebased v8.

[ It looks like my email client truncated the last email somehow, in
case someone was wondering why it just stopped. ]

The big change is the introduction of the icu_multilib contrib module
which provides a lot of the functionality requested in this thread:

* icu version stability, which allows you to "lock down" ICU to a
specific major and minor version (or major version only)
* multi-lib ICU, which (if a GUC is set) will enable the "search by
collversion" behavior. Some doubts were raised about the wisdom of this
approach, but it's the only multi-lib solution we have without doing
some significant catalog work.

I rendered the HTML docs for icu_multilib and attached to this email to
make it easier to view.

icu_multilib assumes that the various ICU library versions are already
available in a single location, most likely installed with a package
manager. That location can be the same as the built-in ICU, or a
different location. Ideally, packagers would start to offer a few
"stable" versions of ICU that would be available for a long time, but
it will take a while for that to happen. So for now, it's up to the
user to figure out how to get the right versions of ICU on their system
and keep them there.

Automated tests of icu_multilib are a problem unless the one running
the tests is willing to compile the right versions of ICU (like I did).
But I at least have automated tests for the hooks by using the test
module test_collator_lib_hooks.

The v7 patches in this thread are dependent on the pure refactoring
patches in this CF entry:

https://commitfest.postgresql.org/41/3935/

/messages/by-id/052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com

The requested functionality _not_ offered by icu_multilib is tying a
specific collation to a specific ICU version. A few variants were
proposed, the latest is to tie a collation to the library file itself
through the provider. That needs to be done with proper catalog support
in core. But I believe the work I've done here has made a lot of
progress in that direction, and also shows the versatility of the new
hook to solve at least some problems.

This thread has been idle for a year now, It has stalled after a lot
of discussion.
@Jeff Davis: Do you want to try to restart the discussion by posting
an updated version and see what happens?

Regards,
Vignesh

#141

Jeff Davis

pgsql@j-davis.com

almost 2 years ago

In reply to: vignesh C (#140)

Re: Collation version tracking for macOS

On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote:

This thread has been idle for a year now, It has stalled after a lot
of discussion.
@Jeff Davis: Do you want to try to restart the discussion by posting
an updated version and see what happens?

Thank you for following up. Yes, I'd like to find a path forward here,
but I need some validation from others on my approach.

I rendered the docs I wrote as an HTML page and attached it to this
thread, to make it easier for others to read and comment. It's
basically a tool for experts who are willing to devote effort to
managing their collations and ICU libraries. Is that what we want?

At an implementation level, did I get the extension APIs right? I
considered making the API simpler, but that would require the extension
to do quite a bit more work (including a lot of redundant work) to use
ICU properly.

Regards,
Jeff Davis

#142

vignesh C

vignesh21@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#141)

Re: Collation version tracking for macOS

On Mon, 22 Jan 2024 at 00:28, Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote:

This thread has been idle for a year now, It has stalled after a lot
of discussion.
@Jeff Davis: Do you want to try to restart the discussion by posting
an updated version and see what happens?

Thank you for following up. Yes, I'd like to find a path forward here,
but I need some validation from others on my approach.

Let's start by posting a rebased version to fix the CFBot patch apply
issue as in [1]http://cfbot.cputube.org/patch_46_3956.log:

=== Applying patches on top of PostgreSQL commit ID
402388946fb3ac54f0fd5944d7e177ef7737eab2 ===
=== applying patch
./v8-0001-Support-multiple-ICU-collation-provider-libraries.patch
patching file src/backend/commands/collationcmds.c
Hunk #1 FAILED at 566.
....
1 out of 4 hunks FAILED -- saving rejects to file
src/backend/commands/collationcmds.c.rej
patching file src/backend/utils/adt/formatting.c
Hunk #1 succeeded at 1575 (offset 9 lines).
Hunk #2 succeeded at 1587 (offset 9 lines).
Hunk #3 succeeded at 1605 (offset 9 lines).
Hunk #4 succeeded at 1700 (offset 3 lines).
Hunk #5 succeeded at 1819 (offset -1 lines).
Hunk #6 succeeded at 1939 (offset -5 lines).
patching file src/backend/utils/adt/pg_locale.c
Hunk #1 FAILED at 70.
...
Hunk #31 FAILED at 2886.
Hunk #32 FAILED at 2902.
22 out of 32 hunks FAILED -- saving rejects to file
src/backend/utils/adt/pg_locale.c.rej

[1]: http://cfbot.cputube.org/patch_46_3956.log

Regards,
Vignesh

#143

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#141)

Re: Collation version tracking for macOS

On Sun, Jan 21, 2024 at 1:58 PM Jeff Davis <pgsql@j-davis.com> wrote:

I rendered the docs I wrote as an HTML page and attached it to this
thread, to make it easier for others to read and comment. It's
basically a tool for experts who are willing to devote effort to
managing their collations and ICU libraries. Is that what we want?

At an implementation level, did I get the extension APIs right? I
considered making the API simpler, but that would require the extension
to do quite a bit more work (including a lot of redundant work) to use
ICU properly.

Not that I'm the most qualified person to have an opinion on this
topic, but did you intend to attach this stuff to this email, or is it
somewhere else?

--
Robert Haas
EDB: http://www.enterprisedb.com

#144

Jeff Davis

pgsql@j-davis.com

almost 2 years ago

In reply to: Robert Haas (#143)

Re: Collation version tracking for macOS

On Thu, 2024-02-01 at 15:58 -0500, Robert Haas wrote:

Not that I'm the most qualified person to have an opinion on this
topic, but did you intend to attach this stuff to this email, or is
it
somewhere else?

The previous patch is here:

/messages/by-id/6f4a8c01a5cb1edf3a07d204c371fbddaef252f9.camel@j-davis.com

And I attached the rendered HTML doc page, which conveniently renders
in the archives (thanks to web team -- I didn't know if that would
actually work until I tried it):

/messages/by-id/attachment/142818/icu-multilib.html

For anyone interested in this work, the docs are the best place to
start.

I'm hesitant to put much more work into it (e.g. new patches, etc.)
without more feedback. Your opinion would certainly be valuable -- for
instance, when reading the docs, can you imagine yourself actually
using this if you ran into a collation versioning/migration problem?

Regards,
Jeff Davis

#145

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#144)

Re: Collation version tracking for macOS

On Sun, Feb 4, 2024 at 10:42 PM Jeff Davis <pgsql@j-davis.com> wrote:

I'm hesitant to put much more work into it (e.g. new patches, etc.)
without more feedback. Your opinion would certainly be valuable -- for
instance, when reading the docs, can you imagine yourself actually
using this if you ran into a collation versioning/migration problem?

I'm having some difficulty understanding what the docs are trying to
tell me. I think there are some issues with ordering and pacing.

"The icu_multilib module provides control over the version (or
versions) of the ICU provider library used by PostgreSQL, which can be
different from the version of ICU with which it was built. Collations
are a product of natural language, and natural language evolves over
time; but PostgreSQL depends on stable ordering for structures such as
indexes. Newer versions of ICU update the provided collators to adapt
to changes in natural language, so it's important to control when and
how those new versions of ICU are used to prevent problems such as
index corruption."

Check. So far, so good.

"This module assumes that the necessary versions of ICU are already
available, such as through the operating system's package manager; and
already properly installed in a single location accessible to
PostgreSQL. The configration variable icu_multilib.library_path should
be set to the location where these ICU library versions are
installed."

Here I feel we've skipped a few steps. I suggest postponing all
discussion of specific GUCs to a later point -- specifically the
configuration parameters section, which I think should actually be
F.19.1, with the use cases following that rather than preceding it. In
this introductory section, I suggest elaborating a bit more on what
problem we're trying to solve at a conceptual level. It feels like
we've gone straight from the very general issue (collation definitions
need to be stable but language isn't) to very specific (here's a GUC
that you can set to a pathname). I feel like the need for this module
should be more specifically motivated. Maybe something like:

1. Here's what we think your OS package manager is probably going to do.
2. That's going to interact with PostgreSQL in this way that I will
now describe.
3. See, that sucks, because of the stuff I said above about needing
stable collations!
4. But if you installed this module instead, then you could prevent
the things I said under #2 from happening.
5. Instead, you'd get this other behavior, which would make you happy.

I feel like I can almost piece together in my head how this is
supposed to work -- I think it's like "we expect the OS package
manager to drop all the ICU versions in the same directory via side by
side installs, and that works well for other programs because ... for
some mysterious reason they can latch onto the specific version they
were linked against ... but we can't or don't do that because ... I
guess we're dumber than those other pieces of software or
something???? ... so this module lets you ask for more sensible
behavior." But I think that could be spelled out a bit more clearly
and directly than this document seems to me to do.

I also wonder if we should be explaining why we don't get this right
out of the box. Like, if the normal behavior categorically sucks, why
do you have to install icu_multilib to get something else? Why not
make the multilib treatment the default? And if the normal behavior is
better for some cases and the icu_multilib behavior is better for
other cases, then maybe we ought to explain which one to use in which
scenario.

"icu_multilib must be loaded via shared_preload_libraries.
icu_multilib ignores any ICU library with a major version greater than
that with which PostgreSQL was built."

It's not clear from reading this whether the second sentence here is a
regrettable implementation restriction or design behavior. If it's
design behavior, what's the point of it?

--
Robert Haas
EDB: http://www.enterprisedb.com

#146

Jeff Davis

pgsql@j-davis.com

almost 2 years ago

In reply to: Robert Haas (#145)

Re: Collation version tracking for macOS

On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote:

1. Here's what we think your OS package manager is probably going to
do.
2. That's going to interact with PostgreSQL in this way that I will
now describe.
3. See, that sucks, because of the stuff I said above about needing
stable collations!
4. But if you installed this module instead, then you could prevent
the things I said under #2 from happening.
5. Instead, you'd get this other behavior, which would make you
happy.

I like that framing, thank you. I'll try to come up with something
there.

I feel like I can almost piece together in my head how this is
supposed to work -- I think it's like "we expect the OS package
manager to drop all the ICU versions in the same directory via side
by
side installs, and that works well for other programs because ... for
some mysterious reason they can latch onto the specific version they
were linked against ... but we can't or don't do that because ... I
guess we're dumber than those other pieces of software or
something???? ...

Postgres can and does latch on to the version of ICU it was compiled
against. It's a normal shared library dependency.

The problem is that databases -- and the file structures -- outlive a
particular version of Postgres. So if Postgres 16 is compiled against
ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade
from 16 to 17? Postgres 17 will try to access the old file structures
using ICU Y, and they'll be corrupt.

What we want is the file structures that depend on ICU X to continue to
find ICU X even after you upgrade to Postgres 17, yet allow new
structures to be created using ICU Y. In other words, "multi-lib",
meaning that the same Postgres binary is linking to multiple versions
of ICU and the different versions for different structures. That would
allow users to recreate one index at a time to use ICU Y, until nothing
depends on ICU X any longer.

I should say this is not an easy process even if something like
icu_multilib is available. We don't have all of the information needed
in the catalog to track which structures depend on which versions of a
collation library, collation library version is itself not easy to
define, and it still involves rebuilding (or at least re-validating) a
lot of structures. This is a "make hard things possible" tool, and I
suspect only a handful of users would use it successfully to migrate to
new ICU versions.

More simply, some users might just want to lock down the version of ICU
to X, and just use that forever until they have a reason to change it.
icu_multilib can also facilitate that, though it's still not trivial.

"icu_multilib must be loaded via shared_preload_libraries.
icu_multilib ignores any ICU library with a major version greater
than
that with which PostgreSQL was built."

It's not clear from reading this whether the second sentence here is
a
regrettable implementation restriction or design behavior. If it's
design behavior, what's the point of it?

That restriction came from Thomas's (uncommitted) work on the same
problem. I believe the reasoning was that we don't know whether future
versions of ICU might break something that we're doing, though perhaps
there's a better way.

Regards,
Jeff Davis

#147

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#146)

Re: Collation version tracking for macOS

On Tue, Feb 13, 2024 at 1:55 AM Jeff Davis <pgsql@j-davis.com> wrote:

Postgres can and does latch on to the version of ICU it was compiled
against. It's a normal shared library dependency.

The problem is that databases -- and the file structures -- outlive a
particular version of Postgres. So if Postgres 16 is compiled against
ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade
from 16 to 17? Postgres 17 will try to access the old file structures
using ICU Y, and they'll be corrupt.

What we want is the file structures that depend on ICU X to continue to
find ICU X even after you upgrade to Postgres 17, yet allow new
structures to be created using ICU Y. In other words, "multi-lib",
meaning that the same Postgres binary is linking to multiple versions
of ICU and the different versions for different structures. That would
allow users to recreate one index at a time to use ICU Y, until nothing
depends on ICU X any longer.

Ah, I see. At least, I think I do. I think some of this material could
be very usefully included into the first section of the doc you're
trying to write. What you say here makes it a lot easier to grasp the
motivation and use case for this code, at least for me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#148

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#146)

Re: Collation version tracking for macOS

On Tue, Feb 13, 2024 at 9:25 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote:

"icu_multilib must be loaded via shared_preload_libraries.
icu_multilib ignores any ICU library with a major version greater
than
that with which PostgreSQL was built."

It's not clear from reading this whether the second sentence here is
a
regrettable implementation restriction or design behavior. If it's
design behavior, what's the point of it?

That restriction came from Thomas's (uncommitted) work on the same
problem. I believe the reasoning was that we don't know whether future
versions of ICU might break something that we're doing, though perhaps
there's a better way.

Right, to spell that out more fully: We compile and link against one
particular ICU library that is present at compile time, and there is a
place in that multi-lib patch that assigns the function pointers from
that version to variables of the function pointer type that we expect.
Compilation would fail if ICU ever changed relevant function
prototypes in a future release, and then we'd have to come up with
some trampoline/wrapper scheme to wallpaper over differences. That's
why I think it's safe to use dlsym() to access function pointers for
versions up to and including the one whose headers we were compiled
against, but not later ones which we haven't tested in that way.

Sadly I won't be able to work on multi-lib ICU support again in this
cycle. I think we managed to prove that dlopen works for this, and
learn some really interesting stuff about Unicode and ICU evolution,
but we still have to come up with the right model, catalogues and DDL
etc, for a nice user experience. What I was most recently
experimenting with based on earlier discussions was the idea of
declaring separate providers: icu72 and icu68 could both exist and you
could create extra indexes and then drop the originals as a
no-downtime upgrade path. I have a pet theory that you could usefully
support multi-version libc locales too if you're prepared to make
certain assumptions (short version: take the collation definition
files from any older version of your OS, compile with newer version's
localedef, give it a name like "en_US@ubuntu18", and assume/pray they
didn't change stuff that wasn't expressed in the definition file), so
I was working on a generalisation slightly wider than just
multi-version ICU.