Unicode update and some tooling improvements
This is the annual update of the Unicode data. I also worked a bit on
the tooling. The update-unicode target under meson did not update the
data in contrib/unaccent/, so I added that. I also fixed a Python
deprecation warning in the generation script and made some light changes
in the surrounding documentation.
Attachments:
0001-Fix-Python-deprecation-warning.patchtext/plain; charset=UTF-8; name=0001-Fix-Python-deprecation-warning.patchDownload+1-2
0002-doc-Fix-capitalization-of-Unicode.patchtext/plain; charset=UTF-8; name=0002-doc-Fix-capitalization-of-Unicode.patchDownload+1-2
0003-Implement-unaccent-Unicode-data-update-in-meson.patchtext/plain; charset=UTF-8; name=0003-Implement-unaccent-Unicode-data-update-in-meson.patchDownload+63-19
0004-Update-RELEASE_CHANGES.patchtext/plain; charset=UTF-8; name=0004-Update-RELEASE_CHANGES.patchDownload+1-3
0005-Update-Unicode-data-to-CLDR-48.1.patchtext/plain; charset=UTF-8; name=0005-Update-Unicode-data-to-CLDR-48.1.patchDownload+2-3
0006-Update-Unicode-data-to-Unicode-17.0.0.patchtext/plain; charset=UTF-8; name=0006-Update-Unicode-data-to-Unicode-17.0.0.patchDownload+4034-3675
On Feb 27, 2026, at 04:36, Peter Eisentraut <peter@eisentraut.org> wrote:
This is the annual update of the Unicode data. I also worked a bit on the tooling. The update-unicode target under meson did not update the data in contrib/unaccent/, so I added that. I also fixed a Python deprecation warning in the generation script and made some light changes in the surrounding documentation.
<0001-Fix-Python-deprecation-warning.patch><0002-doc-Fix-capitalization-of-Unicode.patch><0003-Implement-unaccent-Unicode-data-update-in-meson.patch><0004-Update-RELEASE_CHANGES.patch><0005-Update-Unicode-data-to-CLDR-48.1.patch><0006-Update-Unicode-data-to-Unicode-17.0.0.patch>
Overall looks good to me.
To verify this patch, I upgraded by local ICU to version 78.2, then I tried to run the python script:
```
chaol@ChaodeMacBook-Air postgresql % python3 contrib/unaccent/generate_unaccent_rules.py \
--unicode-data-file src/common/unicode/UnicodeData.txt \
--latin-ascii-file contrib/unaccent/Latin-ASCII.xml \
/tmp/unaccent.rules.new
chaol@ChaodeMacBook-Air postgresql %
chaol@ChaodeMacBook-Air postgresql %
chaol@ChaodeMacBook-Air postgresql % diff -u contrib/unaccent/unaccent.rules /tmp/unaccent.rules.new # no difference
```
And I ran a clean meson build, and specially verified the new Unicode wiring:
```
chaol@ChaodeMacBook-Air postgresql % ninja -C build update-unicode # passed
```
And test:
```
chaol@ChaodeMacBook-Air postgresql % ninja -C build -t targets | grep update-unicode
update-unicode: phony
chaol@ChaodeMacBook-Air postgresql % ninja -C build test # passed
ninja: Entering directory `build'
[406/407] Running all tests
…
Ok: 333
Fail: 0
Skipped: 30
Full log written to /Users/chaol/Documents/code/postgresql/build/meson-logs/testlog.txt
```
Only a small comment on 0003:
```
# Meson 0.57.0 and 0.57.1 are buggy, therefore >=0.57.2.
- meson_version: '>=0.57.2',
+ # FIXME: update comment
+ meson_version: '>=0.58',
```
Why leaves a FIXME instead of just updating the comment? I saw the installation.sgml doc has been updated.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On 27.02.26 03:50, Chao Li wrote:
Only a small comment on 0003:
```
# Meson 0.57.0 and 0.57.1 are buggy, therefore >=0.57.2.
- meson_version: '>=0.57.2',
+ # FIXME: update comment
+ meson_version: '>=0.58',
```Why leaves a FIXME instead of just updating the comment? I saw the installation.sgml doc has been updated.
It wasn't meant to be committed that way. I just didn't want to spend
the time crafting a comment before it was generally agreed to proceed in
this way that required a meson version update.
26.02.2026 23:36, Peter Eisentraut wrote:
This is the annual update of the Unicode data. I also worked a bit on
the tooling. The update-unicode target under meson did not update the
data in contrib/unaccent/, so I added that. I also fixed a Python
deprecation warning in the generation script and made some light changes
in the surrounding documentation.
Installed, tested, checked it out.
I hope I'm not late.
"[PATCH 3/6] Implement unaccent Unicode data update in meson"
The idea of raising the minimum Meson version is good.
But it seems like we can do without raising the version.
As I understand it, the minimum version is being raised because of
.replace(), but it can be successfully replaced here with the following
construct:
cldr_version_dashed = '-'.join(CLDR_VERSION.split('.'))
url = cldr_baseurl.format(cldr_version_dashed, f)
I would increase the minimum version of Meson, but I would do it with a
separate patch so that the commit log would be "loud":
- Increase the minimum version for Meson.
This would be useful for users who look at commit logs.
Currently, the minimum version for Meson is increased "secretly" inside
the patch. Or at least explicitly indicate this in the commit log for
this patch.
Otherwise, looks good to me.
I am in favor of regular Unicode updates. 🙂
--
Best regards,
Alexander Borisov
On 13.03.26 11:11, Alexander Borisov wrote:
26.02.2026 23:36, Peter Eisentraut wrote:
This is the annual update of the Unicode data. I also worked a bit on
the tooling. The update-unicode target under meson did not update the
data in contrib/unaccent/, so I added that. I also fixed a Python
deprecation warning in the generation script and made some light
changes in the surrounding documentation.Installed, tested, checked it out.
I hope I'm not late."[PATCH 3/6] Implement unaccent Unicode data update in meson"
The idea of raising the minimum Meson version is good.
But it seems like we can do without raising the version.
As I understand it, the minimum version is being raised because of
.replace(), but it can be successfully replaced here with the following
construct:
cldr_version_dashed = '-'.join(CLDR_VERSION.split('.'))
url = cldr_baseurl.format(cldr_version_dashed, f)
Good idea. I committed it that way, without a meson version change for
the moment.
Hi,
On 2026-02-26 21:36:08 +0100, Peter Eisentraut wrote:
This is the annual update of the Unicode data. I also worked a bit on the
tooling. The update-unicode target under meson did not update the data in
contrib/unaccent/, so I added that. I also fixed a Python deprecation
warning in the generation script and made some light changes in the
surrounding documentation.
From ef15b16dcef7a3868fc37744d201bb233f8271bd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:36:27 +0100
Subject: [PATCH 3/6] Implement unaccent Unicode data update in mesonThe meson/ninja update-unicode target did not cover the required
updates in contrib/unaccent/. This is fixed now.
Makes sesne.
+# Download CLDR files on demand. + +cldr_baseurl = 'https://raw.githubusercontent.com/unicode-org/cldr/release-@0@/common/transforms/@1@'
Hm. I take it the relevant contents aren't available on unicode.org, which we
use in src/common/unicode?
We reference githubusercontent.com in Makefile too, but somehow that feels a
bit weird.
+if not wget.found() or not cp.found() + subdir_done() +endif + +foreach f : ['Latin-ASCII.xml'] + # XXX .replace requires meson 0.58 + url = cldr_baseurl.format(CLDR_VERSION.replace('.', '-'), f)
I think this could be replaced with something like
CLDR_VERSION.split('.').join('-')
for < 0.58 compat. But I'm also ok with going to 0.58.
From 20d5a665f72b3ddde8bfdf06b94d218da0dc2d09 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:38:16 +0100
Subject: [PATCH 4/6] Update RELEASE_CHANGESThe existing instructions did not cover meson. Point to
src/common/unicode/README instead, where there is more information.
LGTM.
From 868e269b518daf0d3d288e2e379d5fd3ad215f49 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 10:25:48 +0100
Subject: [PATCH 5/6] Update Unicode data to CLDR 48.1No actual changes result.
XXX should change that to CLDR 49 in April
48.2 has been released from what I can tell.
LGTM otherwise.
From dd4b5ced419b319c24fa0928180e54d7317e1690 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:38:51 +0100
Subject: [PATCH 6/6] Update Unicode data to Unicode 17.0.0
Looks like 18 is out, any reason to not go straight to that?
diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 7d65e428607..b99116a9ef8 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -376,7 +376,7 @@ DOWNLOAD = wget -O $@ --no-use-server-timestamps # Pick a release from here: <https://www.unicode.org/Public/>. Note # that the most recent release listed there is often a pre-release; # don't pick that one, except for testing. -UNICODE_VERSION = 16.0.0 +UNICODE_VERSION = 17.0.0
Wonder if we, in a separate change, should put UNICODE_VERSION and
CLDR_VERSION version in dedicated files (probably just named
UNICODE_VERSION/CLDR_VERSION) that then could be shared by meson & make.
Greetings,
Andres Freund
18.03.2026 17:20, Andres Freund wrote:
Hi,
On 2026-02-26 21:36:08 +0100, Peter Eisentraut wrote:
This is the annual update of the Unicode data. I also worked a bit on the
tooling. The update-unicode target under meson did not update the data in
contrib/unaccent/, so I added that. I also fixed a Python deprecation
warning in the generation script and made some light changes in the
surrounding documentation.From ef15b16dcef7a3868fc37744d201bb233f8271bd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:36:27 +0100
Subject: [PATCH 3/6] Implement unaccent Unicode data update in mesonThe meson/ninja update-unicode target did not cover the required
updates in contrib/unaccent/. This is fixed now.Makes sesne.
[..]
From dd4b5ced419b319c24fa0928180e54d7317e1690 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:38:51 +0100
Subject: [PATCH 6/6] Update Unicode data to Unicode 17.0.0Looks like 18 is out, any reason to not go straight to that?
18 is currently in alpha, so it may be better to wait until the stable
release in September this year.
https://www.unicode.org/releases/
[..]
-UNICODE_VERSION = 16.0.0 +UNICODE_VERSION = 17.0.0Wonder if we, in a separate change, should put UNICODE_VERSION and
CLDR_VERSION version in dedicated files (probably just named
UNICODE_VERSION/CLDR_VERSION) that then could be shared by meson & make.Greetings,
Andres Freund
--
Regards,
Alexander Borisov