Update Unicode data to Unicode 16.0.0
Here is the patch to update the Unicode data to version 16.0.0.
Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0]/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.
[0]: /messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com
Attachments:
0001-Update-Unicode-data-to-Unicode-16.0.0.patchtext/plain; charset=UTF-8; name=0001-Update-Unicode-data-to-Unicode-16.0.0.patchDownload+4370-3946
On 11/11/24 01:27, Peter Eisentraut wrote:
Here is the patch to update the Unicode data to version 16.0.0.
Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0] AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.[0]:
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com
I ran a check and found that this patch causes changes in upper casing
of some characters. Repro:
setup
8<-------------
wget https://joeconway.com/presentations/formated-unicode.txt
initdb
psql
CREATE DATABASE builtincoll
LOCALE_PROVIDER builtin
BUILTIN_LOCALE 'C.UTF-8'
TEMPLATE template0;
\c builtincoll
CREATE TABLE unsorted_table(strings text);
\copy unsorted_table from formated-unicode.txt (format csv)
VACUUM FREEZE ANALYZE unsorted_table;
8<-------------
8<-------------
-- on master
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
97f83a4d1937aa65bcf8be134bf7b0c4
(1 row)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
8cf65a43affc221f3a20645ef402085e
(1 row)
8<-------------
8<-------------
-- master+patch
builtincoll=# WITH t AS (SELECT lower(strings) AS s FROM unsorted_table
ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
7ec7f5c2d8729ec960942942bb82aedd
(1 row)
Time: 19858.981 ms (00:19.859)
builtincoll=# WITH t AS (SELECT upper(strings) AS s FROM unsorted_table
ORDER BY 1)SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
3055b3d5dff76c8c1250ef500c6ec13f
(1 row)
Time: 19774.467 ms (00:19.774)
builtincoll=# WITH t AS (SELECT initcap(strings) AS s FROM
unsorted_table ORDER BY 1)
SELECT md5(string_agg(t.s,NULL)) FROM t;
md5
----------------------------------
9985acddf7902ea603897cdaccd02114
(1 row)
8<-------------
So both UPPER and INITCAP produce different results unless I am missing
something.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:
On 11/11/24 01:27, Peter Eisentraut wrote:
Here is the patch to update the Unicode data to version 16.0.0.
Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0] AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.[0]:
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.comI ran a check and found that this patch causes changes in upper casing
of some characters.
I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much against it.
From my experiences in the field, I consider this need to rebuild indexes
one of the greatest current problems for the usability of PostgreSQL.
I dare say that most people would prefer living with an outdated Unicode version.
Yours,
Laurenz Albe
On 12.11.24 10:40, Laurenz Albe wrote:
On Mon, 2024-11-11 at 14:52 -0500, Joe Conway wrote:
On 11/11/24 01:27, Peter Eisentraut wrote:
Here is the patch to update the Unicode data to version 16.0.0.
Normally, this would have been routine, but a few months ago there was
some debate about how this should be handled. [0] AFAICT, the consensus
was to go ahead with it, but I just wanted to notify it here to be clear.[0]:
/messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.comI ran a check and found that this patch causes changes in upper casing
of some characters.I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much against it.
The practice of regularly updating the Unicode files is older than the
builtin collation provider. It is similar to updating the time zone
files, the encoding conversion files, the snowball files, etc. We need
to move all of these things forward to keep up with the aspects of the
real world that this data reflects. New features are required to live
in that environment. If a new feature were proposed that would then
require us to stop updating any of these files, we would likely not
accept that, or at least need a very deliberate discussion about that
before the feature is introduced. This was not done here at all. If
this new feature has this hidden requirement, then that feature is not
complete yet, and work should probably continue to make that feature
complete. But that can't take progress in other areas hostage.
On Mon, Nov 11, 2024 at 07:27:53AM +0100, Peter Eisentraut wrote:
Here is the patch to update the Unicode data to version 16.0.0.
Normally, this would have been routine, but a few months ago there was some
debate about how this should be handled. [0] AFAICT, the consensus was to
go ahead with it, but I just wanted to notify it here to be clear.[0]: /messages/by-id/d75d2d0d1d2bd45b2c332c47e3e0a67f0640b49c.camel@j-davis.com
Worth noting that unaccent.rules is unchanged after switching to
16.0.0:
cd contrib/unaccent && make update-unicode
--
Michael
On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:
I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much
against it.
How would you feel if there was a better way to "lock down" the
behavior using an extension?
I have a patchset here:
/messages/by-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel@j-davis.com
that changes the implementation of collation and ctype to use method
tables rather than branching, and it also introduces some hooks that
can be used to replace the method tables with whatever you want.
Regards,
Jeff Davis
On Tue, 2024-11-19 at 13:42 -0800, Jeff Davis wrote:
On Tue, 2024-11-12 at 10:40 +0100, Laurenz Albe wrote:
I want to reiterate what I said in the above thread:
If that means that indexes on strings using the "builtin" collation
provider need to be reindexed after an upgrade, I am very much
against it.How would you feel if there was a better way to "lock down" the
behavior using an extension?
Better.
I have a patchset here:
/messages/by-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel@j-davis.com
that changes the implementation of collation and ctype to use method
tables rather than branching, and it also introduces some hooks that
can be used to replace the method tables with whatever you want.
That looks like a nice idea, since it obviates the need to build
PostgreSQL yourself if you want to use a non-standard copy of - say -
the ICU library. You still have to build your own ICU library, though.
I had hoped that the builtin provider would remove the need to REINDEX,
but I have given up that hope. Peter's argument is sound from a
conceptual point of view, even though I doubt that the average user
will be able to appreciate it.
Yours,
Laurenz Albe
On Wed, 2024-11-20 at 06:41 +0100, Laurenz Albe wrote:
That looks like a nice idea, since it obviates the need to build
PostgreSQL yourself if you want to use a non-standard copy of - say -
the ICU library. You still have to build your own ICU library,
though.
It would work with the builtin provider, too, which would not require
ICU at all.
The idea is that you could build an extension that copies the same
logic for building the Unicode tables that we have in Postgres now,
except that it uses whatever version of the Unicode data files you
want.
If we want it to be targeted more specifically at the builtin provider,
we can make it even simpler by allowing you to just replace the unicode
tables with an extension (rather than the method tables). I'm not 100%
sure what people actually want here, so I'm open to suggestion.
I had hoped that the builtin provider would remove the need to
REINDEX,
but I have given up that hope. Peter's argument is sound from a
conceptual point of view, even though I doubt that the average user
will be able to appreciate it.
I'd like to provide options for all kinds of users and packagers.
Regards,
Jeff Davis
On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:
The practice of regularly updating the Unicode files is older than
the
builtin collation provider. It is similar to updating the time zone
files, the encoding conversion files, the snowball files, etc. We
need
to move all of these things forward to keep up with the aspects of
the
real world that this data reflects.
Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?
That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.
Regards,
Jeff Davis
On Mon, 20 Jan 2025 13:39:35 -0800
Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:
The practice of regularly updating the Unicode files is older than
the
builtin collation provider. It is similar to updating the time
zone files, the encoding conversion files, the snowball files, etc.
We need
to move all of these things forward to keep up with the aspects of
the
real world that this data reflects.Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.
FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore. I think ICU is
the best solution for people who need the latest linguistic collation
rules.
On the user side, my main concerns are the same as they've always
been: 100% confidence that Postgres updates will not corrupt any data
or cause incorrect query results, and not being forced to rebuild
everything (or logically copy data to avoid pg_upgrade). I'm at a large
company with many internal devs using Postgres in ways I don't know
about, and many users storing lots of unicode data I don't know about.
I'm working a fair bit with Docker and Kubernetes and CloudNativePG
now, so our builds come through the debian PGDG repo. Bundling multiple
tables doesn't bother me, as long as it's not a precursor to removing
current tables from the debian PGDG builds we consume in the future.
Ironically it's not really an issue yet for us on docker because
support for pg_upgrade is pretty limited at the moment. :) But I
think pg_upgrade support will rapidly improve in docker, and will
become common on large databases.
If Postgres does go the path of multiple tables, does the community
want to accumulate a new set of tables every year? That could add up
quickly. Maybe we don't add new tables every year, but follow the
examples of Oracle and DB2 in accumulating them on a less frequent
basis?
-Jeremy
On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote:
FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore.
I agree that it's not a pressing concern.
If Postgres does go the path of multiple tables, does the community
want to accumulate a new set of tables every year? That could add up
quickly. Maybe we don't add new tables every year, but follow the
examples of Oracle and DB2 in accumulating them on a less frequent
basis?
Yeah, it would probably be every-other-release or something. By the
time we built up enough versions for someone to worry about, hopefully
we'd have some better systems in place to track versions and migrate
forward.
Regards,
Jeff Davis
On Mon, 2025-01-20 at 17:06 -0800, Jeremy Schneider wrote:
On the user side, my main concerns are the same as they've always
been: 100% confidence that Postgres updates will not corrupt any data
or cause incorrect query results
I'll add that, while 100% may be a good goal, it hasn't been the
standard in the past. You're talking about a new standard of
immutability starting in 18, and as Peter pointed out, I don't think
Unicode updates are the only thing we need to consider.
My personal opinion is that both positions -- to upgrade Unicode or not
-- are a bit exaggerated. On the one hand, there's no urgency to
updating Unicode; but on the other hand, there's not a huge danger, at
least compared with our historical standards.
Regards,
Jeff Davis
On 21.01.25 02:06, Jeremy Schneider wrote:
FWIW, after adding ICU support I personally don't think there's a
pressing need to continue updating the tables anymore.
That appears to ignore what these tables are actually used for. They
are used for Unicode normalization, which is used by SCRAM. So in a
slightly hyperbolic sense, keeping these tables updated is
security-relevant. They are also used by psql to determine character
width and format output correctly.
Building a collation provider on this came much later. It was possibly
a mistake how that was done.
On 20.01.25 22:39, Jeff Davis wrote:
On Fri, 2024-11-15 at 17:09 +0100, Peter Eisentraut wrote:
The practice of regularly updating the Unicode files is older than
the
builtin collation provider. It is similar to updating the time zone
files, the encoding conversion files, the snowball files, etc. We
need
to move all of these things forward to keep up with the aspects of
the
real world that this data reflects.Should we consider bundling multiple versions of the generated tables
(header files) along with Postgres?
I wouldn't have a problem with that.
That would enable a compile-time option to build with an older version
of Unicode if you want, solving the packager concern that Noah raised.
It would also make it easier for people to coordinate the Postgres
version of Unicode and the ICU version of Unicode.
But I don't think it would be a compile-time decision. I think it would
be a run-time selection, similar to the theorized multiple-ICU-versions
feature. (Those two features might even go together, since a given ICU
version also sort of assumes a given Unicode version.)
On Wed, 2025-01-22 at 19:08 +0100, Peter Eisentraut wrote:
But I don't think it would be a compile-time decision. I think it
would
be a run-time selection, similar to the theorized multiple-ICU-
versions
feature. (Those two features might even go together, since a given
ICU
version also sort of assumes a given Unicode version.)
I am trying to get there, and the ctype methods patch is a step in that
direction, but I don't think we will have full the full multi-library-
versions work in v18.
A compile-time option does have a chance for v18, and if that satisfies
the immediate concerns of packagers, then we can still update Unicode
in the default build.
Regards,
Jeff Davis
On Wed, 2025-01-22 at 19:03 +0100, Peter Eisentraut wrote:
Building a collation provider on this came much later. It was
possibly
a mistake how that was done.
It wasn't a mistake. "Stability within a PG major version" was called a
*benefit* near the top of the first email on the subject[1]/messages/by-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com. It was
considered a benefit because it offered a level of stability that
neither libc nor ICU could offer. As far as I know, it's still
considered to be a benefit today by more people than not (e.g. [2]/messages/by-id/3729436.1721322211@sss.pgh.pa.us).
The concerns about Unicode updates come from a misunderstanding of the
level of stability offered in the past:
* IMMUTABLE was initially a planner concept[3]/messages/by-id/3428810.1721160969@sss.pgh.pa.us, which is why it didn't
care much about dependence on GUCs for instance.
* Expression / predicate indexes rely on immutability to mean something
more strict, and for that, dependence on GUCs creates a problem[4]CREATE TABLE t(f float4); CREATE UNIQUE INDEX t_idx ON t((f::text)); SET extra_float_digits = 0; INSERT INTO t VALUES (1.23456789); INSERT INTO t VALUES (1.23456789); -- error SET extra_float_digits = 1; INSERT INTO t VALUES (1.23456789); -- success.
(Also, partitioning.)
* It's hard to make an immutable UDF without a SET search_path clause,
but until version 17, that was such a huge performance hit that it was
not usable in an expression index. There will be a lot of not-truly-
immutable UDFs used in expression indexes for a long time.
* Ordinary text indexes rely on the collation libraries to be stable,
which is hard to control because they could be updated by the OS. It's
barely possible recently to freeze the version of libc[5]https://github.com/awslabs/compat-collation-for-glibc without
freezing the whole OS version. And if you do manage to freeze both libc
and ICU, you are risking missed security fixes.
* pg_upgrade implicitly relies on IMMUTABLE to mean something even more
strict: stability across major versions. That's a problem for
expression indexes on functions like NORMALIZE(). And, if using the
optional built-in provider, also a problem for expression indexes on
LOWER(), etc.
At each moment we took steps that made sense at the time and in context
and I am not criticizing any of those steps. The biggest practical
problem was unforseen dramatic changes in glibc that broke a lot of
text indexes. The rest of the problems are a mix of design issues,
feature interactions, and implementation details that were not resolved
before the builtin provider existed and still not resolved today.
I do not accept the premise that there is a problem with the built-in
provider. I didn't throw caution to the wind and neither did the
reviewers: you, Daniel, Jeremy, and I did a ton of work to understand,
mitigate, and document the risks (along with a lot of help from
Thomas's earlier work). Users who opt-in to the built in provider opt-
in to occasional controlled changes according to the rather strict
Unicode stability policies[6]https://www.unicode.org/policies/stability_policy.html. These policies mitigate risks
dramatically, especially for those using only assigned code points,
which can be checked with the SQL function unicode_assigned().
Regards,
Jeff Davis
[1]: /messages/by-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
/messages/by-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
[2]: /messages/by-id/3729436.1721322211@sss.pgh.pa.us
/messages/by-id/3729436.1721322211@sss.pgh.pa.us
[3]: /messages/by-id/3428810.1721160969@sss.pgh.pa.us
/messages/by-id/3428810.1721160969@sss.pgh.pa.us
[4]: CREATE TABLE t(f float4); CREATE UNIQUE INDEX t_idx ON t((f::text)); SET extra_float_digits = 0; INSERT INTO t VALUES (1.23456789); INSERT INTO t VALUES (1.23456789); -- error SET extra_float_digits = 1; INSERT INTO t VALUES (1.23456789); -- success
CREATE TABLE t(f float4);
CREATE UNIQUE INDEX t_idx ON t((f::text));
SET extra_float_digits = 0;
INSERT INTO t VALUES (1.23456789);
INSERT INTO t VALUES (1.23456789); -- error
SET extra_float_digits = 1;
INSERT INTO t VALUES (1.23456789); -- success
On Mon, 2024-11-11 at 07:27 +0100, Peter Eisentraut wrote:
Here is the patch to update the Unicode data to version 16.0.0.
Normally, this would have been routine, but a few months ago there
was
some debate about how this should be handled. [0] AFAICT, the
consensus
was to go ahead with it, but I just wanted to notify it here to be
clear.
We discussed $SUBJECT at the Developer Meeting before FOSDEM.
Those people who were most concerned about the Unicode updates on the
list were not present, so I don't consider the discussion to be
binding. But the attendees present agreed that:
(a) we should not block the update to Unicode indefinitely; and
(b) we should make reasonable attempts to mitigate potential
problems.
One idea for (b) resurfaced, which was to make a best-effort check at
pg_upgrade time for affected indexes. The check would not be
bulletproof, because we can't catch dependencies that are hidden inside
SPI (e.g. a plpgsql function that calls LOWER()), but it would catch
most potential problems.
Patch attached. A few notes:
* The dependency entries don't exist because LOWER(), etc., are
system objects (pinned); so it queries the indexprs, indpreds,
partexprs, and conbin.
* The query is large and perhaps too clever, but it seems to work. I
tried to add inline comments to the SQL, and pgindent had its own ideas
about how to format them -- suggestions welcome.
* We haven't actually done the Unicode update yet, so it will notice
that the PG17 and PG18 Unicode versions are the same, and return early.
Either apply on top of the Unicode update patch, or comment out the
early return for testing.
* It emits a warning rather than an error, so you need to specify
pg_upgrade with "-r" to see the output file.
* I didn't adapt the query to run on pre-17 versions, even though it
could find some potential problem cases (like an index on NORMALIZE()).
I can add that if someone thinks it's worthwhile.
Regards,
Jeff Davis
Attachments:
v1-0001-Add-pg_upgrade-check-for-Unicode-dependent-relati.patchtext/x-patch; charset=UTF-8; name=v1-0001-Add-pg_upgrade-check-for-Unicode-dependent-relati.patchDownload+179-1
On 05.02.25 22:47, Jeff Davis wrote:
(b) we should make reasonable attempts to mitigate potential
problems.One idea for (b) resurfaced, which was to make a best-effort check at
pg_upgrade time for affected indexes. The check would not be
bulletproof, because we can't catch dependencies that are hidden inside
SPI (e.g. a plpgsql function that calls LOWER()), but it would catch
most potential problems.Patch attached. A few notes:
* The dependency entries don't exist because LOWER(), etc., are
system objects (pinned); so it queries the indexprs, indpreds,
partexprs, and conbin.
* The query is large and perhaps too clever, but it seems to work. I
tried to add inline comments to the SQL, and pgindent had its own ideas
about how to format them -- suggestions welcome.
* We haven't actually done the Unicode update yet, so it will notice
that the PG17 and PG18 Unicode versions are the same, and return early.
Either apply on top of the Unicode update patch, or comment out the
early return for testing.
* It emits a warning rather than an error, so you need to specify
pg_upgrade with "-r" to see the output file.
* I didn't adapt the query to run on pre-17 versions, even though it
could find some potential problem cases (like an index on NORMALIZE()).
I can add that if someone thinks it's worthwhile.
This all looks quite reasonable to me. The code could obviously use a
bit closer inspection, but the way you've structured it looked quite
okay to me.
On Mon, 2024-11-18 at 13:58 +0900, Michael Paquier wrote:
Worth noting that unaccent.rules is unchanged after switching to
16.0.0:
cd contrib/unaccent && make update-unicode
What diffs are you seeing? I don't see any diffs to unaccent.rules
since Unicode 14.0.0.
Aside: it looks like that target is not being run with the meson build.
Regards,
Jeff Davis
On Fri, 2025-02-14 at 09:02 +0100, Peter Eisentraut wrote:
This all looks quite reasonable to me. The code could obviously use
a
bit closer inspection, but the way you've structured it looked quite
okay to me.
Attached a version that rebases both patches. In my patch, I added a
report_status().
Note that the Unicode update has a few test diffs for NORMALIZE(),
please check to see if the tests themselves need an update.
Regards,
Jeff Davis