Update Unicode data to Unicode 16.0.0

Started by Peter Eisentrautover 1 year ago69 messages
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0]/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.

[0]: /messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com

Attachments:

0001-Update-Unicode-data-to-Unicode-16.0.0.patchtext/plain; charset=UTF-8; name=0001-Update-Unicode-data-to-Unicode-16.0.0.patchDownload+4370-3946
#2Joe Conway
mail@joeconway.com
In reply to: Peter Eisentraut (#1)
Re: Update Unicode data to Unicode 16.0.0

On 11/11/24 01:27, Peter Eisentraut wrote:

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0] AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.

[0]:
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com

I ran a check and found that this patch causes changes in upper casing
of some characters. Repro:

setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
LOCALE_PROVIDER builtin
BUILTIN_LOCALE 'C.UTF-8'
TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------

8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)

builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)

builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------

8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)

Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
3055b3d5dff76c8c1250ef500c6ec13f
(1 row)

Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------

So both UPPER and INITCAP produce different results unless I am missing
something.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Joe Conway (#2)
Re: Update Unicode data to Unicode 16.0.0

On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:

On 11/11/24 01:27, Peter Eisentraut wrote:

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0]  AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.

[0]:
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com

I ran a check and found that this patch causes changes in upper casing
of some characters.

I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much against it.

From my experiences in the field, I consider this need to rebuild indexes
one of the greatest current problems for the usability of PostgreSQL.
I dare say that most people would prefer living with an outdated Unicode version.

Yours,
Laurenz Albe

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Laurenz Albe (#3)
Re: Update Unicode data to Unicode 16.0.0

On 12.11.24 10:40, Laurenz Albe wrote:

On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:

On 11/11/24 01:27, Peter Eisentraut wrote:

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0]  AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.

[0]:
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com

I ran a check and found that this patch causes changes in upper casing
of some characters.

I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much against it.

The practice of regularly updating the Unicode files is older than the
builtin collation provider. It is similar to updating the time zone
files, the encoding conversion files, the snowball files, etc. We need
to move all of these things forward to keep up with the aspects of the
real world that this data reflects. New features are required to live
in that environment. If a new feature were proposed that would then
require us to stop updating any of these files, we would likely not
accept that, or at least need a very deliberate discussion about that
before the feature is introduced. This was not done here at all. If
this new feature has this hidden requirement, then that feature is not
complete yet, and work should probably continue to make that feature
complete. But that can't take progress in other areas hostage.

#5Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#1)
Re: Update Unicode data to Unicode 16.0.0

On Mon, Nov 11, 2024 at 07:27:53AM +0100, Peter Eisentraut wrote:

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there was some
debate about how this should be handled. [0] AFAICT, the consensus was to
go ahead with it, but I just wanted to notify it here to be clear.

[0]: /messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com

Worth noting that unaccent.rules is unchanged after switching to
16.0.0:
cd contrib/unaccent && make update-unicode
--
Michael

#6Jeff Davis
pgsql@j-davis.com
In reply to: Laurenz Albe (#3)
Re: Update Unicode data to Unicode 16.0.0

On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:

I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much
against it.

How would you feel if there was a better way to "lock down" the
behavior using an extension?

I have a patchset here:

/messages/by-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel@j-davis.com

that changes the implementation of collation and ctype to use method
tables rather than branching, and it also introduces some hooks that
can be used to replace the method tables with whatever you want.

Regards,
Jeff Davis

#7Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Jeff Davis (#6)
Re: Update Unicode data to Unicode 16.0.0

On Tue, 2024-11-19 at 13:42 -0800, Jeff Davis wrote:

On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:

I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much
against it.

How would you feel if there was a better way to "lock down" the
behavior using an extension?

Better.

I have a patchset here:

/messages/by-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel@j-davis.com

that changes the implementation of collation and ctype to use method
tables rather than branching, and it also introduces some hooks that
can be used to replace the method tables with whatever you want.

That looks like a nice idea, since it obviates the need to build
PostgreSQL yourself if you want to use a non-standard copy of - say -
the ICU library. You still have to build your own ICU library, though.

I had hoped that the builtin provider would remove the need to REINDEX,
but I have given up that hope. Peter's argument is sound from a
conceptual point of view, even though I doubt that the average user
will be able to appreciate it.

Yours,
Laurenz Albe

#8Jeff Davis
pgsql@j-davis.com
In reply to: Laurenz Albe (#7)
Re: Update Unicode data to Unicode 16.0.0

On Wed, 2024-11-20 at 06:41 +0100, Laurenz Albe wrote:

That looks like a nice idea, since it obviates the need to build
PostgreSQL yourself if you want to use a non-standard copy of - say -
the ICU library.  You still have to build your own ICU library,
though.

It would work with the builtin provider, too, which would not require
ICU at all.

The idea is that you could build an extension that copies the same
logic for building the Unicode tables that we have in Postgres now,
except that it uses whatever version of the Unicode data files you
want.

If we want it to be targeted more specifically at the builtin provider,
we can make it even simpler by allowing you to just replace the unicode
tables with an extension (rather than the method tables). I'm not 100%
sure what people actually want here, so I'm open to suggestion.

I had hoped that the builtin provider would remove the need to
REINDEX,
but I have given up that hope.  Peter's argument is sound from a
conceptual point of view, even though I doubt that the average user
will be able to appreciate it.

I'd like to provide options for all kinds of users and packagers.

Regards,
Jeff Davis

#9Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#4)
Re: Update Unicode data to Unicode 16.0.0

On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:

The practice of regularly updating the Unicode files is older than
the
builtin collation provider.  It is similar to updating the time zone
files, the encoding conversion files, the snowball files, etc.  We
need
to move all of these things forward to keep up with the aspects of
the
real world that this data reflects.

Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?

That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.

Regards,
Jeff Davis

#10Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeff Davis (#9)
Re: Update Unicode data to Unicode 16.0.0

On Mon, 20 Jan 2025 13:39:35 -0800
Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:

The practice of regularly updating the Unicode files is older than
the
builtin collation provider.  It is similar to updating the time
zone files, the encoding conversion files, the snowball files, etc.
We need
to move all of these things forward to keep up with the aspects of
the
real world that this data reflects.

Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?

That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.

FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore. I think ICU is
the best solution for people who need the latest linguistic collation
rules.

On the user side, my main concerns are the same as they've always
been: 100% confidence that Postgres updates will not corrupt any data
or cause incorrect query results, and not being forced to rebuild
everything (or logically copy data to avoid pg_upgrade). I'm at a large
company with many internal devs using Postgres in ways I don't know
about, and many users storing lots of unicode data I don't know about.

I'm working a fair bit with Docker and Kubernetes and CloudNativePG
now, so our builds come through the debian PGDG repo. Bundling multiple
tables doesn't bother me, as long as it's not a precursor to removing
current tables from the debian PGDG builds we consume in the future.

Ironically it's not really an issue yet for us on docker because
support for pg_upgrade is pretty limited at the moment. :) But I
think pg_upgrade support will rapidly improve in docker, and will
become common on large databases.

If Postgres does go the path of multiple tables, does the community
want to accumulate a new set of tables every year? That could add up
quickly. Maybe we don't add new tables every year, but follow the
examples of Oracle and DB2 in accumulating them on a less frequent
basis?

-Jeremy

#11Jeff Davis
pgsql@j-davis.com
In reply to: Jeremy Schneider (#10)
Re: Update Unicode data to Unicode 16.0.0

On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote:

FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore.

I agree that it's not a pressing concern.

If Postgres does go the path of multiple tables, does the community
want to accumulate a new set of tables every year? That could add up
quickly. Maybe we don't add new tables every year, but follow the
examples of Oracle and DB2 in accumulating them on a less frequent
basis?

Yeah, it would probably be every-other-release or something. By the
time we built up enough versions for someone to worry about, hopefully
we'd have some better systems in place to track versions and migrate
forward.

Regards,
Jeff Davis

#12Jeff Davis
pgsql@j-davis.com
In reply to: Jeremy Schneider (#10)
Re: Update Unicode data to Unicode 16.0.0

On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote:

On the user side, my main concerns are the same as they've always
been: 100% confidence that Postgres updates will not corrupt any data
or cause incorrect query results

I'll add that, while 100% may be a good goal, it hasn't been the
standard in the past. You're talking about a new standard of
immutability starting in 18, and as Peter pointed out, I don't think
Unicode updates are the only thing we need to consider.

My personal opinion is that both positions -- to upgrade Unicode or not
-- are a bit exaggerated. On the one hand, there's no urgency to
updating Unicode; but on the other hand, there's not a huge danger, at
least compared with our historical standards.

Regards,
Jeff Davis

#13Peter Eisentraut
peter_e@gmx.net
In reply to: Jeremy Schneider (#10)
Re: Update Unicode data to Unicode 16.0.0

On 21.01.25 02:06, Jeremy Schneider wrote:

FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore.

That appears to ignore what these tables are actually used for. They
are used for Unicode normalization, which is used by SCRAM. So in a
slightly hyperbolic sense, keeping these tables updated is
security-relevant. They are also used by psql to determine character
width and format output correctly.

Building a collation provider on this came much later. It was possibly
a mistake how that was done.

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#9)
Re: Update Unicode data to Unicode 16.0.0

On 20.01.25 22:39, Jeff Davis wrote:

On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:

The practice of regularly updating the Unicode files is older than
the
builtin collation provider.  It is similar to updating the time zone
files, the encoding conversion files, the snowball files, etc.  We
need
to move all of these things forward to keep up with the aspects of
the
real world that this data reflects.

Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?

I wouldn't have a problem with that.

That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.

But I don't think it would be a compile-time decision. I think it would
be a run-time selection, similar to the theorized multiple-ICU-versions
feature. (Those two features might even go together, since a given ICU
version also sort of assumes a given Unicode version.)

#15Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#14)
Re: Update Unicode data to Unicode 16.0.0

On Wed, 2025-01-22 at 19:08 +0100, Peter Eisentraut wrote:

But I don't think it would be a compile-time decision.  I think it
would
be a run-time selection, similar to the theorized multiple-ICU-
versions
feature.  (Those two features might even go together, since a given
ICU
version also sort of assumes a given Unicode version.)

I am trying to get there, and the ctype methods patch is a step in that
direction, but I don't think we will have full the full multi-library-
versions work in v18.

A compile-time option does have a chance for v18, and if that satisfies
the immediate concerns of packagers, then we can still update Unicode
in the default build.

Regards,
Jeff Davis

#16Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#13)
Re: Update Unicode data to Unicode 16.0.0

On Wed, 2025-01-22 at 19:03 +0100, Peter Eisentraut wrote:

Building a collation provider on this came much later.  It was
possibly
a mistake how that was done.

It wasn't a mistake. "Stability within a PG major version" was called a
*benefit* near the top of the first email on the subject[1]/messages/by-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com. It was
considered a benefit because it offered a level of stability that
neither libc nor ICU could offer. As far as I know, it's still
considered to be a benefit today by more people than not (e.g. [2]/messages/by-id/3729436.1721322211@sss.pgh.pa.us).

The concerns about Unicode updates come from a misunderstanding of the
level of stability offered in the past:

* IMMUTABLE was initially a planner concept[3]/messages/by-id/3428810.1721160969@sss.pgh.pa.us, which is why it didn't
care much about dependence on GUCs for instance.

* Expression / predicate indexes rely on immutability to mean something
more strict, and for that, dependence on GUCs creates a problem[4]CREATE TABLE t(f float4); CREATE UNIQUE INDEX t_idx ON t((f::text)); SET extra_float_digits = 0; INSERT INTO t VALUES (1.23456789); INSERT INTO t VALUES (1.23456789); -- error SET extra_float_digits = 1; INSERT INTO t VALUES (1.23456789); -- success.
(Also, partitioning.)

* It's hard to make an immutable UDF without a SET search_path clause,
but until version 17, that was such a huge performance hit that it was
not usable in an expression index. There will be a lot of not-truly-
immutable UDFs used in expression indexes for a long time.

* Ordinary text indexes rely on the collation libraries to be stable,
which is hard to control because they could be updated by the OS. It's
barely possible recently to freeze the version of libc[5]https://github.com/awslabs/compat-collation-for-glibc without
freezing the whole OS version. And if you do manage to freeze both libc
and ICU, you are risking missed security fixes.

* pg_upgrade implicitly relies on IMMUTABLE to mean something even more
strict: stability across major versions. That's a problem for
expression indexes on functions like NORMALIZE(). And, if using the
optional built-in provider, also a problem for expression indexes on
LOWER(), etc.

At each moment we took steps that made sense at the time and in context
and I am not criticizing any of those steps. The biggest practical
problem was unforseen dramatic changes in glibc that broke a lot of
text indexes. The rest of the problems are a mix of design issues,
feature interactions, and implementation details that were not resolved
before the builtin provider existed and still not resolved today.

I do not accept the premise that there is a problem with the built-in
provider. I didn't throw caution to the wind and neither did the
reviewers: you, Daniel, Jeremy, and I did a ton of work to understand,
mitigate, and document the risks (along with a lot of help from
Thomas's earlier work). Users who opt-in to the built in provider opt-
in to occasional controlled changes according to the rather strict
Unicode stability policies[6]https://www.unicode.org/policies/stability_policy.html. These policies mitigate risks
dramatically, especially for those using only assigned code points,
which can be checked with the SQL function unicode_assigned().

Regards,
Jeff Davis

[1]: /messages/by-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
/messages/by-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com

[2]: /messages/by-id/3729436.1721322211@sss.pgh.pa.us
/messages/by-id/3729436.1721322211@sss.pgh.pa.us

[3]: /messages/by-id/3428810.1721160969@sss.pgh.pa.us
/messages/by-id/3428810.1721160969@sss.pgh.pa.us

[4]: CREATE TABLE t(f float4); CREATE UNIQUE INDEX t_idx ON t((f::text)); SET extra_float_digits = 0; INSERT INTO t VALUES (1.23456789); INSERT INTO t VALUES (1.23456789); -- error SET extra_float_digits = 1; INSERT INTO t VALUES (1.23456789); -- success

CREATE TABLE t(f float4);
CREATE UNIQUE INDEX t_idx ON t((f::text));
SET extra_float_digits = 0;
INSERT INTO t VALUES (1.23456789);
INSERT INTO t VALUES (1.23456789); -- error
SET extra_float_digits = 1;
INSERT INTO t VALUES (1.23456789); -- success

[5]: https://github.com/awslabs/compat-collation-for-glibc

[6]: https://www.unicode.org/policies/stability_policy.html

#17Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#1)
Re: Update Unicode data to Unicode 16.0.0

On Mon, 2024-11-11 at 07:27 +0100, Peter Eisentraut wrote:

Here is the patch to update the Unicode data to version 16.0.0.

Normally, this would have been routine, but a few months ago there
was
some debate about how this should be handled. [0]  AFAICT, the
consensus
was to go ahead with it, but I just wanted to notify it here to be
clear.

We discussed $SUBJECT at the Developer Meeting before FOSDEM.

Those people who were most concerned about the Unicode updates on the
list were not present, so I don't consider the discussion to be
binding. But the attendees present agreed that:

(a) we should not block the update to Unicode indefinitely; and
(b) we should make reasonable attempts to mitigate potential
problems.

One idea for (b) resurfaced, which was to make a best-effort check at
pg_upgrade time for affected indexes. The check would not be
bulletproof, because we can't catch dependencies that are hidden inside
SPI (e.g. a plpgsql function that calls LOWER()), but it would catch
most potential problems.

Patch attached. A few notes:

* The dependency entries don't exist because LOWER(), etc., are
system objects (pinned); so it queries the indexprs, indpreds,
partexprs, and conbin.
* The query is large and perhaps too clever, but it seems to work. I
tried to add inline comments to the SQL, and pgindent had its own ideas
about how to format them -- suggestions welcome.
* We haven't actually done the Unicode update yet, so it will notice
that the PG17 and PG18 Unicode versions are the same, and return early.
Either apply on top of the Unicode update patch, or comment out the
early return for testing.
* It emits a warning rather than an error, so you need to specify
pg_upgrade with "-r" to see the output file.
* I didn't adapt the query to run on pre-17 versions, even though it
could find some potential problem cases (like an index on NORMALIZE()).
I can add that if someone thinks it's worthwhile.

Regards,
Jeff Davis

Attachments:

v1-0001-Add-pg_upgrade-check-for-Unicode-dependent-relati.patchtext/x-patch; charset=UTF-8; name=v1-0001-Add-pg_upgrade-check-for-Unicode-dependent-relati.patchDownload+179-1
#18Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#17)
Re: Update Unicode data to Unicode 16.0.0

On 05.02.25 22:47, Jeff Davis wrote:

(b) we should make reasonable attempts to mitigate potential
problems.

One idea for (b) resurfaced, which was to make a best-effort check at
pg_upgrade time for affected indexes. The check would not be
bulletproof, because we can't catch dependencies that are hidden inside
SPI (e.g. a plpgsql function that calls LOWER()), but it would catch
most potential problems.

Patch attached. A few notes:

* The dependency entries don't exist because LOWER(), etc., are
system objects (pinned); so it queries the indexprs, indpreds,
partexprs, and conbin.
* The query is large and perhaps too clever, but it seems to work. I
tried to add inline comments to the SQL, and pgindent had its own ideas
about how to format them -- suggestions welcome.
* We haven't actually done the Unicode update yet, so it will notice
that the PG17 and PG18 Unicode versions are the same, and return early.
Either apply on top of the Unicode update patch, or comment out the
early return for testing.
* It emits a warning rather than an error, so you need to specify
pg_upgrade with "-r" to see the output file.
* I didn't adapt the query to run on pre-17 versions, even though it
could find some potential problem cases (like an index on NORMALIZE()).
I can add that if someone thinks it's worthwhile.

This all looks quite reasonable to me. The code could obviously use a
bit closer inspection, but the way you've structured it looked quite
okay to me.

#19Jeff Davis
pgsql@j-davis.com
In reply to: Michael Paquier (#5)
Re: Update Unicode data to Unicode 16.0.0

On Mon, 2024-11-18 at 13:58 +0900, Michael Paquier wrote:

Worth noting that unaccent.rules is unchanged after switching to
16.0.0:
cd contrib/unaccent && make update-unicode

What diffs are you seeing? I don't see any diffs to unaccent.rules
since Unicode 14.0.0.

Aside: it looks like that target is not being run with the meson build.

Regards,
Jeff Davis

#20Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#18)
Re: Update Unicode data to Unicode 16.0.0

On Fri, 2025-02-14 at 09:02 +0100, Peter Eisentraut wrote:

This all looks quite reasonable to me.  The code could obviously use
a
bit closer inspection, but the way you've structured it looked quite
okay to me.

Attached a version that rebases both patches. In my patch, I added a
report_status().

Note that the Unicode update has a few test diffs for NORMALIZE(),
please check to see if the tests themselves need an update.

Regards,
Jeff Davis

Attachments:

v2-0001-Update-Unicode-data-to-Unicode-16.0.0.patchtext/x-patch; charset=UTF-8; name=v2-0001-Update-Unicode-data-to-Unicode-16.0.0.patchDownload+729-446
v2-0002-Add-pg_upgrade-check-for-Unicode-dependent-relati.patchtext/x-patch; charset=UTF-8; name=v2-0002-Add-pg_upgrade-check-for-Unicode-dependent-relati.patchDownload+180-1
#21Michael Paquier
michael@paquier.xyz
In reply to: Jeff Davis (#19)
#22Nathan Bossart
nathandbossart@gmail.com
In reply to: Jeff Davis (#20)
#23Jeff Davis
pgsql@j-davis.com
In reply to: Nathan Bossart (#22)
#24Nathan Bossart
nathandbossart@gmail.com
In reply to: Jeff Davis (#23)
#25Jeff Davis
pgsql@j-davis.com
In reply to: Nathan Bossart (#24)
#26Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#20)
#27Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeff Davis (#25)
#28Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Jeremy Schneider (#27)
#29Joe Conway
mail@joeconway.com
In reply to: Laurenz Albe (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeremy Schneider (#27)
#31Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#30)
#32Jeff Davis
pgsql@j-davis.com
In reply to: Jeremy Schneider (#27)
#33Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeff Davis (#31)
#34vignesh C
vignesh21@gmail.com
In reply to: Jeff Davis (#25)
#35Jeff Davis
pgsql@j-davis.com
In reply to: Jeremy Schneider (#33)
#36Jeff Davis
pgsql@j-davis.com
In reply to: vignesh C (#34)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#35)
#38Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#26)
#39Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#30)
#40vignesh C
vignesh21@gmail.com
In reply to: Jeff Davis (#36)
#41Jeff Davis
pgsql@j-davis.com
In reply to: Joe Conway (#29)
#42Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeff Davis (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#41)
#44Jeff Davis
pgsql@j-davis.com
In reply to: Jeremy Schneider (#42)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#44)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
#48Joe Conway
mail@joeconway.com
In reply to: Robert Haas (#47)
#49Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#43)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#49)
#51Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#50)
#52Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeff Davis (#51)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#51)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Jeremy Schneider (#52)
#55Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#53)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#55)
#57Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#56)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#57)
#59Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#58)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#59)
#61Peter Eisentraut
peter_e@gmx.net
In reply to: Jeremy Schneider (#27)
#62Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#61)
#63Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#62)
#64Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#60)
#65Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Robert Haas (#63)
#66Jeremy Schneider
schneider@ardentperf.com
In reply to: Jeff Davis (#64)
#67Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#38)
#68Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#22)
#69Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#68)