Unicode update and some tooling improvements

Started by Peter Eisentraut20 days ago7 messages
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

This is the annual update of the Unicode data. I also worked a bit on
the tooling. The update-unicode target under meson did not update the
data in contrib/unaccent/, so I added that. I also fixed a Python
deprecation warning in the generation script and made some light changes
in the surrounding documentation.

Attachments:

0001-Fix-Python-deprecation-warning.patchtext/plain; charset=UTF-8; name=0001-Fix-Python-deprecation-warning.patchDownload+1-2
0002-doc-Fix-capitalization-of-Unicode.patchtext/plain; charset=UTF-8; name=0002-doc-Fix-capitalization-of-Unicode.patchDownload+1-2
0003-Implement-unaccent-Unicode-data-update-in-meson.patchtext/plain; charset=UTF-8; name=0003-Implement-unaccent-Unicode-data-update-in-meson.patchDownload+63-19
0004-Update-RELEASE_CHANGES.patchtext/plain; charset=UTF-8; name=0004-Update-RELEASE_CHANGES.patchDownload+1-3
0005-Update-Unicode-data-to-CLDR-48.1.patchtext/plain; charset=UTF-8; name=0005-Update-Unicode-data-to-CLDR-48.1.patchDownload+2-3
0006-Update-Unicode-data-to-Unicode-17.0.0.patchtext/plain; charset=UTF-8; name=0006-Update-Unicode-data-to-Unicode-17.0.0.patchDownload+4034-3675
#2Chao Li
li.evan.chao@gmail.com
In reply to: Peter Eisentraut (#1)
Re: Unicode update and some tooling improvements

On Feb 27, 2026, at 04:36, Peter Eisentraut <peter@eisentraut.org> wrote:

This is the annual update of the Unicode data. I also worked a bit on the tooling. The update-unicode target under meson did not update the data in contrib/unaccent/, so I added that. I also fixed a Python deprecation warning in the generation script and made some light changes in the surrounding documentation.
<0001-Fix-Python-deprecation-warning.patch><0002-doc-Fix-capitalization-of-Unicode.patch><0003-Implement-unaccent-Unicode-data-update-in-meson.patch><0004-Update-RELEASE_CHANGES.patch><0005-Update-Unicode-data-to-CLDR-48.1.patch><0006-Update-Unicode-data-to-Unicode-17.0.0.patch>

Overall looks good to me.

To verify this patch, I upgraded by local ICU to version 78.2, then I tried to run the python script:
```
chaol@ChaodeMacBook-Air postgresql % python3 contrib/unaccent/generate_unaccent_rules.py \
--unicode-data-file src/common/unicode/UnicodeData.txt \
--latin-ascii-file contrib/unaccent/Latin-ASCII.xml \

/tmp/unaccent.rules.new

chaol@ChaodeMacBook-Air postgresql %
chaol@ChaodeMacBook-Air postgresql %
chaol@ChaodeMacBook-Air postgresql % diff -u contrib/unaccent/unaccent.rules /tmp/unaccent.rules.new # no difference
```

And I ran a clean meson build, and specially verified the new Unicode wiring:
```
chaol@ChaodeMacBook-Air postgresql % ninja -C build update-unicode # passed
```

And test:
```
chaol@ChaodeMacBook-Air postgresql % ninja -C build -t targets | grep update-unicode
update-unicode: phony
chaol@ChaodeMacBook-Air postgresql % ninja -C build test # passed
ninja: Entering directory `build'
[406/407] Running all tests

Ok: 333
Fail: 0
Skipped: 30

Full log written to /Users/chaol/Documents/code/postgresql/build/meson-logs/testlog.txt
```

Only a small comment on 0003:
```
# Meson 0.57.0 and 0.57.1 are buggy, therefore >=0.57.2.
- meson_version: '>=0.57.2',
+ # FIXME: update comment
+ meson_version: '>=0.58',
```

Why leaves a FIXME instead of just updating the comment? I saw the installation.sgml doc has been updated.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Chao Li (#2)
Re: Unicode update and some tooling improvements

On 27.02.26 03:50, Chao Li wrote:

Only a small comment on 0003:
```
# Meson 0.57.0 and 0.57.1 are buggy, therefore >=0.57.2.
- meson_version: '>=0.57.2',
+ # FIXME: update comment
+ meson_version: '>=0.58',
```

Why leaves a FIXME instead of just updating the comment? I saw the installation.sgml doc has been updated.

It wasn't meant to be committed that way. I just didn't want to spend
the time crafting a comment before it was generally agreed to proceed in
this way that required a meson version update.

#4Alexander Borisov
lex.borisov@gmail.com
In reply to: Peter Eisentraut (#1)
Re: Unicode update and some tooling improvements

26.02.2026 23:36, Peter Eisentraut wrote:

This is the annual update of the Unicode data.  I also worked a bit on
the tooling.  The update-unicode target under meson did not update the
data in contrib/unaccent/, so I added that.  I also fixed a Python
deprecation warning in the generation script and made some light changes
in the surrounding documentation.

Installed, tested, checked it out.
I hope I'm not late.

"[PATCH 3/6] Implement unaccent Unicode data update in meson"

The idea of raising the minimum Meson version is good.
But it seems like we can do without raising the version.
As I understand it, the minimum version is being raised because of
.replace(), but it can be successfully replaced here with the following
construct:
cldr_version_dashed = '-'.join(CLDR_VERSION.split('.'))
url = cldr_baseurl.format(cldr_version_dashed, f)

I would increase the minimum version of Meson, but I would do it with a
separate patch so that the commit log would be "loud":
- Increase the minimum version for Meson.

This would be useful for users who look at commit logs.
Currently, the minimum version for Meson is increased "secretly" inside
the patch. Or at least explicitly indicate this in the commit log for
this patch.

Otherwise, looks good to me.
I am in favor of regular Unicode updates. 🙂

--
Best regards,
Alexander Borisov

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Alexander Borisov (#4)
Re: Unicode update and some tooling improvements

On 13.03.26 11:11, Alexander Borisov wrote:

26.02.2026 23:36, Peter Eisentraut wrote:

This is the annual update of the Unicode data.  I also worked a bit on
the tooling.  The update-unicode target under meson did not update the
data in contrib/unaccent/, so I added that.  I also fixed a Python
deprecation warning in the generation script and made some light
changes in the surrounding documentation.

Installed, tested, checked it out.
I hope I'm not late.

"[PATCH 3/6] Implement unaccent Unicode data update in meson"

The idea of raising the minimum Meson version is good.
But it seems like we can do without raising the version.
As I understand it, the minimum version is being raised because of
.replace(), but it can be successfully replaced here with the following
construct:
cldr_version_dashed = '-'.join(CLDR_VERSION.split('.'))
url = cldr_baseurl.format(cldr_version_dashed, f)

Good idea. I committed it that way, without a meson version change for
the moment.

#6Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#1)
Re: Unicode update and some tooling improvements

Hi,

On 2026-02-26 21:36:08 +0100, Peter Eisentraut wrote:

This is the annual update of the Unicode data. I also worked a bit on the
tooling. The update-unicode target under meson did not update the data in
contrib/unaccent/, so I added that. I also fixed a Python deprecation
warning in the generation script and made some light changes in the
surrounding documentation.

From ef15b16dcef7a3868fc37744d201bb233f8271bd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:36:27 +0100
Subject: [PATCH 3/6] Implement unaccent Unicode data update in meson

The meson/ninja update-unicode target did not cover the required
updates in contrib/unaccent/. This is fixed now.

Makes sesne.

+# Download CLDR files on demand.
+
+cldr_baseurl = 'https://raw.githubusercontent.com/unicode-org/cldr/release-@0@/common/transforms/@1@'

Hm. I take it the relevant contents aren't available on unicode.org, which we
use in src/common/unicode?

We reference githubusercontent.com in Makefile too, but somehow that feels a
bit weird.

+if not wget.found() or not cp.found()
+  subdir_done()
+endif
+
+foreach f : ['Latin-ASCII.xml']
+  # XXX .replace requires meson 0.58
+  url = cldr_baseurl.format(CLDR_VERSION.replace('.', '-'), f)

I think this could be replaced with something like
CLDR_VERSION.split('.').join('-')
for < 0.58 compat. But I'm also ok with going to 0.58.

From 20d5a665f72b3ddde8bfdf06b94d218da0dc2d09 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:38:16 +0100
Subject: [PATCH 4/6] Update RELEASE_CHANGES

The existing instructions did not cover meson. Point to
src/common/unicode/README instead, where there is more information.

LGTM.

From 868e269b518daf0d3d288e2e379d5fd3ad215f49 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 10:25:48 +0100
Subject: [PATCH 5/6] Update Unicode data to CLDR 48.1

No actual changes result.

XXX should change that to CLDR 49 in April

48.2 has been released from what I can tell.

LGTM otherwise.

From dd4b5ced419b319c24fa0928180e54d7317e1690 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:38:51 +0100
Subject: [PATCH 6/6] Update Unicode data to Unicode 17.0.0

Looks like 18 is out, any reason to not go straight to that?

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 7d65e428607..b99116a9ef8 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -376,7 +376,7 @@ DOWNLOAD = wget -O $@ --no-use-server-timestamps
# Pick a release from here: <https://www.unicode.org/Public/>.  Note
# that the most recent release listed there is often a pre-release;
# don't pick that one, except for testing.
-UNICODE_VERSION = 16.0.0
+UNICODE_VERSION = 17.0.0

Wonder if we, in a separate change, should put UNICODE_VERSION and
CLDR_VERSION version in dedicated files (probably just named
UNICODE_VERSION/CLDR_VERSION) that then could be shared by meson & make.

Greetings,

Andres Freund

#7Alexander Borisov
lex.borisov@gmail.com
In reply to: Andres Freund (#6)
Re: Unicode update and some tooling improvements

18.03.2026 17:20, Andres Freund wrote:

Hi,

On 2026-02-26 21:36:08 +0100, Peter Eisentraut wrote:

This is the annual update of the Unicode data. I also worked a bit on the
tooling. The update-unicode target under meson did not update the data in
contrib/unaccent/, so I added that. I also fixed a Python deprecation
warning in the generation script and made some light changes in the
surrounding documentation.

From ef15b16dcef7a3868fc37744d201bb233f8271bd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:36:27 +0100
Subject: [PATCH 3/6] Implement unaccent Unicode data update in meson

The meson/ninja update-unicode target did not cover the required
updates in contrib/unaccent/. This is fixed now.

Makes sesne.

[..]

From dd4b5ced419b319c24fa0928180e54d7317e1690 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Feb 2026 11:38:51 +0100
Subject: [PATCH 6/6] Update Unicode data to Unicode 17.0.0

Looks like 18 is out, any reason to not go straight to that?

18 is currently in alpha, so it may be better to wait until the stable
release in September this year.

https://www.unicode.org/releases/

[..]

-UNICODE_VERSION = 16.0.0
+UNICODE_VERSION = 17.0.0

Wonder if we, in a separate change, should put UNICODE_VERSION and
CLDR_VERSION version in dedicated files (probably just named
UNICODE_VERSION/CLDR_VERSION) that then could be shared by meson & make.

Greetings,

Andres Freund

--
Regards,
Alexander Borisov