Add support for automatically updating Unicode derived files

Started by Peter Eisentrautabout 6 years ago10 messages

peter.eisentraut@2ndquadrant.com

about 6 years ago

1 attachment(s)

Continuing the discussion from [0]/messages/by-id/bbb19114-af1e-513b-08a9-61272794bd5c@2ndquadrant.com and [1]/messages/by-id/77f69366-ca31-6437-079f-47fce69bae1b@2ndquadrant.com, here is a patch that
automates the process of updating Unicode derived files. Summary:

- Edit UNICODE_VERSION and/or CLDR_VERSION in src/Makefile.global.in
- Run make update-unicode
- Commit

I have added that to the release checklist in RELEASE_NOTES.

This also includes the script used in [0]/messages/by-id/bbb19114-af1e-513b-08a9-61272794bd5c@2ndquadrant.com that was not committed at that
time. Other than that, this just refactors existing build code.

Open questions that are currently not handled consistently:

- Should the downloaded files be listed in .gitignore?
- Should the downloaded files be cleaned by make clean (or distclean or
maintainer-clean or none)?
- Should the generated files be excluded from pgindent? Currently, the
generated files will not pass pgindent unchanged, so that could cause
annoying whitespace battles when these files are updated and re-indented
around release time.

[0]: /messages/by-id/bbb19114-af1e-513b-08a9-61272794bd5c@2ndquadrant.com
/messages/by-id/bbb19114-af1e-513b-08a9-61272794bd5c@2ndquadrant.com
[1]: /messages/by-id/77f69366-ca31-6437-079f-47fce69bae1b@2ndquadrant.com
/messages/by-id/77f69366-ca31-6437-079f-47fce69bae1b@2ndquadrant.com

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Add-support-for-automatically-updating-Unicode-deriv.patchtext/plain; charset=UTF-8; name=0001-Add-support-for-automatically-updating-Unicode-deriv.patch; x-mac-creator=0; x-mac-type=0Download

From 54a6d956cb49185af714e9739b47cab27b7f27f1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 29 Oct 2019 10:43:27 +0100
Subject: [PATCH] Add support for automatically updating Unicode derived files

We currently have several sets of files generated from data provided
by Unicode.  These all have ad hoc rules and instructions for updating
when new Unicode versions appear, and it's not done consistently.

This patch centralizes and automates the process and makes it part of
the release checklist.  The Unicode and CLDR versions are specified in
Makefile.global.in.  There is a new make target "update-unicode" that
downloads all the relevant files and runs the generation script.

There is also a new script for generating the table of combining
characters for ucs_wcwidth().  That table is now in a separate include
file rather than hardcoded into the middle of other code.  This is
based on the script that was used for generating
d8594d123c155aeecd47fc2450f62f5100b2fbf0, but the script itself wasn't
committed at that time.
---
 GNUmakefile.in                                |   4 +
 contrib/unaccent/Makefile                     |  14 ++
 contrib/unaccent/generate_unaccent_rules.py   |  10 +-
 src/Makefile.global.in                        |  18 +-
 src/backend/utils/mb/Unicode/Makefile         |   3 -
 src/backend/utils/mb/wchar.c                  |  68 +-----
 src/common/unicode/.gitignore                 |   5 -
 src/common/unicode/Makefile                   |  15 +-
 .../generate-unicode_combining_table.pl       |  52 +++++
 src/include/common/unicode_combining_table.h  | 194 ++++++++++++++++++
 src/tools/RELEASE_CHANGES                     |   3 +
 11 files changed, 300 insertions(+), 86 deletions(-)
 create mode 100644 src/common/unicode/generate-unicode_combining_table.pl
 create mode 100644 src/include/common/unicode_combining_table.h

diff --git a/GNUmakefile.in b/GNUmakefile.in
index 9dc373c79c..6174d22b0c 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -75,6 +75,10 @@ $(call recurse,installcheck-world,src/test src/pl src/interfaces/ecpg contrib sr
 GNUmakefile: GNUmakefile.in $(top_builddir)/config.status
 	./config.status $@
 
+update-unicode:
+	$(MAKE) -C src/common/unicode $@
+	$(MAKE) -C contrib/unaccent $@
+
 
 ##########################################################################
 
diff --git a/contrib/unaccent/Makefile b/contrib/unaccent/Makefile
index f8e3860926..37257a7e35 100644
--- a/contrib/unaccent/Makefile
+++ b/contrib/unaccent/Makefile
@@ -24,3 +24,17 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+update-unicode:
+	$(MAKE) unaccent.rules
+
+unaccent.rules: generate_unaccent_rules.py ../../src/common/unicode/UnicodeData.txt Latin-ASCII.xml
+	$(PYTHON) $< --unicode-data-file $(word 2,$^) --latin-ascii-file $(word 3,$^) >$@
+
+# only download it once
+../../src/common/unicode/UnicodeData.txt:
+	$(MAKE) -C $(@D) $(@F)
+
+# dependency on Makefile.global is for CLDR_VERSION
+Latin-ASCII.xml: $(top_builddir)/src/Makefile.global
+	$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/cldr/release-$(subst .,-,$(CLDR_VERSION))/common/transforms/Latin-ASCII.xml
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index acfb4f0b68..a952de510c 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -20,13 +20,11 @@
 # option is enabled, the XML file of this transliterator [2] -- given as a
 # command line argument -- will be parsed and used.
 #
-# Ideally you should use the latest release for each data set.  For
-# Latin-ASCII.xml, the latest data sets released can be browsed directly
-# via [3].  Note that this script is compatible with at least release 29.
+# Ideally you should use the latest release for each data set.  This
+# script is compatible with at least CLDR release 29.
 #
-# [1] https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt
-# [2] https://raw.githubusercontent.com/unicode-org/cldr/release-34/common/transforms/Latin-ASCII.xml
-# [3] https://github.com/unicode-org/cldr/tags
+# [1] https://www.unicode.org/Public/${UNICODE_VERSION}/ucd/UnicodeData.txt
+# [2] https://raw.githubusercontent.com/unicode-org/cldr/${TAG}/common/transforms/Latin-ASCII.xml
 
 # BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
 # The approach is to be Python3 compatible with Python2 "backports".
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 05b66380e0..1e2f4ee405 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -23,7 +23,7 @@ standard_targets = all install installdirs uninstall distprep clean distclean ma
 # these targets should recurse even into subdirectories not being built:
 standard_always_targets = distprep clean distclean maintainer-clean
 
-.PHONY: $(standard_targets) install-strip html man installcheck-parallel
+.PHONY: $(standard_targets) install-strip html man installcheck-parallel update-unicode
 
 # make `all' the default target
 all:
@@ -351,6 +351,22 @@ XGETTEXT = @XGETTEXT@
 GZIP	= gzip
 BZIP2	= bzip2
 
+DOWNLOAD = wget -O $@ --no-use-server-timestamps
+#DOWNLOAD = curl -o $@
+
+
+# Unicode data information
+
+# Before each major release, update these and run make update-unicode.
+
+# Pick a release from here: <https://www.unicode.org/Public/>.  Note
+# that the most recent release listed there is often a pre-release;
+# don't pick that one, except for testing.
+UNICODE_VERSION = 12.1.0
+
+# Pick a release from here: <http://cldr.unicode.org/index/downloads>
+CLDR_VERSION = 34
+
 
 # Tree-wide build support
 
diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index 63710f9ea7..20c6849a65 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -115,9 +115,6 @@ maintainer-clean: distclean
 	rm -f $(MAPS)
 
 
-DOWNLOAD = wget -O $@ --no-use-server-timestamps
-#DOWNLOAD = curl -o $@
-
 BIG5.TXT CNS11643.TXT:
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/$(@F)
 
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index b2d598cbee..02e2588ffe 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -643,73 +643,7 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
 static int
 ucs_wcwidth(pg_wchar ucs)
 {
-	/* sorted list of non-overlapping intervals of non-spacing characters */
-	static const struct mbinterval combining[] = {
-		{0x0300, 0x036F}, {0x0483, 0x0489}, {0x0591, 0x05BD},
-		{0x05BF, 0x05BF}, {0x05C1, 0x05C2}, {0x05C4, 0x05C5},
-		{0x05C7, 0x05C7}, {0x0610, 0x061A}, {0x064B, 0x065F},
-		{0x0670, 0x0670}, {0x06D6, 0x06DC}, {0x06DF, 0x06E4},
-		{0x06E7, 0x06E8}, {0x06EA, 0x06ED}, {0x0711, 0x0711},
-		{0x0730, 0x074A}, {0x07A6, 0x07B0}, {0x07EB, 0x07F3},
-		{0x07FD, 0x07FD}, {0x0816, 0x0819}, {0x081B, 0x0823},
-		{0x0825, 0x0827}, {0x0829, 0x082D}, {0x0859, 0x085B},
-		{0x08D3, 0x08E1}, {0x08E3, 0x0902}, {0x093A, 0x093A},
-		{0x093C, 0x093C}, {0x0941, 0x0948}, {0x094D, 0x094D},
-		{0x0951, 0x0957}, {0x0962, 0x0963}, {0x0981, 0x0981},
-		{0x09BC, 0x09BC}, {0x09C1, 0x09C4}, {0x09CD, 0x09CD},
-		{0x09E2, 0x09E3}, {0x09FE, 0x0A02}, {0x0A3C, 0x0A3C},
-		{0x0A41, 0x0A51}, {0x0A70, 0x0A71}, {0x0A75, 0x0A75},
-		{0x0A81, 0x0A82}, {0x0ABC, 0x0ABC}, {0x0AC1, 0x0AC8},
-		{0x0ACD, 0x0ACD}, {0x0AE2, 0x0AE3}, {0x0AFA, 0x0B01},
-		{0x0B3C, 0x0B3C}, {0x0B3F, 0x0B3F}, {0x0B41, 0x0B44},
-		{0x0B4D, 0x0B56}, {0x0B62, 0x0B63}, {0x0B82, 0x0B82},
-		{0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C00, 0x0C00},
-		{0x0C04, 0x0C04}, {0x0C3E, 0x0C40}, {0x0C46, 0x0C56},
-		{0x0C62, 0x0C63}, {0x0C81, 0x0C81}, {0x0CBC, 0x0CBC},
-		{0x0CBF, 0x0CBF}, {0x0CC6, 0x0CC6}, {0x0CCC, 0x0CCD},
-		{0x0CE2, 0x0CE3}, {0x0D00, 0x0D01}, {0x0D3B, 0x0D3C},
-		{0x0D41, 0x0D44}, {0x0D4D, 0x0D4D}, {0x0D62, 0x0D63},
-		{0x0DCA, 0x0DCA}, {0x0DD2, 0x0DD6}, {0x0E31, 0x0E31},
-		{0x0E34, 0x0E3A}, {0x0E47, 0x0E4E}, {0x0EB1, 0x0EB1},
-		{0x0EB4, 0x0EBC}, {0x0EC8, 0x0ECD}, {0x0F18, 0x0F19},
-		{0x0F35, 0x0F35}, {0x0F37, 0x0F37}, {0x0F39, 0x0F39},
-		{0x0F71, 0x0F7E}, {0x0F80, 0x0F84}, {0x0F86, 0x0F87},
-		{0x0F8D, 0x0FBC}, {0x0FC6, 0x0FC6}, {0x102D, 0x1030},
-		{0x1032, 0x1037}, {0x1039, 0x103A}, {0x103D, 0x103E},
-		{0x1058, 0x1059}, {0x105E, 0x1060}, {0x1071, 0x1074},
-		{0x1082, 0x1082}, {0x1085, 0x1086}, {0x108D, 0x108D},
-		{0x109D, 0x109D}, {0x135D, 0x135F}, {0x1712, 0x1714},
-		{0x1732, 0x1734}, {0x1752, 0x1753}, {0x1772, 0x1773},
-		{0x17B4, 0x17B5}, {0x17B7, 0x17BD}, {0x17C6, 0x17C6},
-		{0x17C9, 0x17D3}, {0x17DD, 0x17DD}, {0x180B, 0x180D},
-		{0x1885, 0x1886}, {0x18A9, 0x18A9}, {0x1920, 0x1922},
-		{0x1927, 0x1928}, {0x1932, 0x1932}, {0x1939, 0x193B},
-		{0x1A17, 0x1A18}, {0x1A1B, 0x1A1B}, {0x1A56, 0x1A56},
-		{0x1A58, 0x1A60}, {0x1A62, 0x1A62}, {0x1A65, 0x1A6C},
-		{0x1A73, 0x1A7F}, {0x1AB0, 0x1B03}, {0x1B34, 0x1B34},
-		{0x1B36, 0x1B3A}, {0x1B3C, 0x1B3C}, {0x1B42, 0x1B42},
-		{0x1B6B, 0x1B73}, {0x1B80, 0x1B81}, {0x1BA2, 0x1BA5},
-		{0x1BA8, 0x1BA9}, {0x1BAB, 0x1BAD}, {0x1BE6, 0x1BE6},
-		{0x1BE8, 0x1BE9}, {0x1BED, 0x1BED}, {0x1BEF, 0x1BF1},
-		{0x1C2C, 0x1C33}, {0x1C36, 0x1C37}, {0x1CD0, 0x1CD2},
-		{0x1CD4, 0x1CE0}, {0x1CE2, 0x1CE8}, {0x1CED, 0x1CED},
-		{0x1CF4, 0x1CF4}, {0x1CF8, 0x1CF9}, {0x1DC0, 0x1DFF},
-		{0x20D0, 0x20F0}, {0x2CEF, 0x2CF1}, {0x2D7F, 0x2D7F},
-		{0x2DE0, 0x2DFF}, {0x302A, 0x302D}, {0x3099, 0x309A},
-		{0xA66F, 0xA672}, {0xA674, 0xA67D}, {0xA69E, 0xA69F},
-		{0xA6F0, 0xA6F1}, {0xA802, 0xA802}, {0xA806, 0xA806},
-		{0xA80B, 0xA80B}, {0xA825, 0xA826}, {0xA8C4, 0xA8C5},
-		{0xA8E0, 0xA8F1}, {0xA8FF, 0xA8FF}, {0xA926, 0xA92D},
-		{0xA947, 0xA951}, {0xA980, 0xA982}, {0xA9B3, 0xA9B3},
-		{0xA9B6, 0xA9B9}, {0xA9BC, 0xA9BD}, {0xA9E5, 0xA9E5},
-		{0xAA29, 0xAA2E}, {0xAA31, 0xAA32}, {0xAA35, 0xAA36},
-		{0xAA43, 0xAA43}, {0xAA4C, 0xAA4C}, {0xAA7C, 0xAA7C},
-		{0xAAB0, 0xAAB0}, {0xAAB2, 0xAAB4}, {0xAAB7, 0xAAB8},
-		{0xAABE, 0xAABF}, {0xAAC1, 0xAAC1}, {0xAAEC, 0xAAED},
-		{0xAAF6, 0xAAF6}, {0xABE5, 0xABE5}, {0xABE8, 0xABE8},
-		{0xABED, 0xABED}, {0xFB1E, 0xFB1E}, {0xFE00, 0xFE0F},
-		{0xFE20, 0xFE2F},
-	};
+#include "common/unicode_combining_table.h"
 
 	/* test for 8-bit control characters */
 	if (ucs == 0)
diff --git a/src/common/unicode/.gitignore b/src/common/unicode/.gitignore
index 5e583e2ccc..67f62d1aca 100644
--- a/src/common/unicode/.gitignore
+++ b/src/common/unicode/.gitignore
@@ -1,7 +1,2 @@
 /norm_test
 /norm_test_table.h
-
-# Files downloaded from the Unicode Character Database
-/CompositionExclusions.txt
-/NormalizationTest.txt
-/UnicodeData.txt
diff --git a/src/common/unicode/Makefile b/src/common/unicode/Makefile
index 334859c984..532846ab58 100644
--- a/src/common/unicode/Makefile
+++ b/src/common/unicode/Makefile
@@ -18,18 +18,25 @@ LIBS += $(PTHREAD_LIBS)
 # By default, do nothing.
 all:
 
-DOWNLOAD = wget -O $@ --no-use-server-timestamps
+update-unicode:
+	$(MAKE) unicode_norm_table.h unicode_combining_table.h
+	$(MAKE) normalization-check
+	mv unicode_norm_table.h unicode_combining_table.h ../../../src/include/common/
 
 # These files are part of the Unicode Character Database. Download
-# them on demand.
-UnicodeData.txt CompositionExclusions.txt NormalizationTest.txt:
-	$(DOWNLOAD) https://www.unicode.org/Public/UNIDATA/$(@F)
+# them on demand.  The dependency on Makefile.global is for
+# UNICODE_VERSION.
+UnicodeData.txt CompositionExclusions.txt NormalizationTest.txt: $(top_builddir)/src/Makefile.global
+	$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)
 
 # Generation of conversion tables used for string normalization with
 # UTF-8 strings.
 unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt CompositionExclusions.txt
 	$(PERL) generate-unicode_norm_table.pl
 
+unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
+	$(PERL) $^ >$@
+
 # Test suite
 normalization-check: norm_test
 	./norm_test
diff --git a/src/common/unicode/generate-unicode_combining_table.pl b/src/common/unicode/generate-unicode_combining_table.pl
new file mode 100644
index 0000000000..e468a5f8c9
--- /dev/null
+++ b/src/common/unicode/generate-unicode_combining_table.pl
@@ -0,0 +1,52 @@
+#!/usr/bin/perl
+#
+# Generate sorted list of non-overlapping intervals of non-spacing
+# characters, using Unicode data files as input.  Pass UnicodeData.txt
+# as argument.  The output is on stdout.
+#
+# Copyright (c) 2019, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+my $range_start = undef;
+my $codepoint;
+my $prev_codepoint;
+my $count = 0;
+
+print "/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */\n\n";
+
+print "static const struct mbinterval combining[] = {\n";
+
+foreach my $line (<ARGV>)
+{
+    chomp $line;
+    my @fields = split ';', $line;
+    $codepoint = hex $fields[0];
+
+    next if $codepoint > 0xFFFF;
+
+    if ($fields[2] eq 'Me' || $fields[2] eq 'Mn')
+    {
+        # combining character, save for start of range
+        if (!defined($range_start))
+        {
+            $range_start = $codepoint;
+        }
+    }
+    else
+    {
+        # not a combining character, print out previous range if any
+        if (defined($range_start))
+        {
+            printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_codepoint;
+            $range_start = undef;
+        }
+    }
+}
+continue
+{
+    $prev_codepoint = $codepoint;
+}
+
+print "};\n";
diff --git a/src/include/common/unicode_combining_table.h b/src/include/common/unicode_combining_table.h
new file mode 100644
index 0000000000..b4a8588238
--- /dev/null
+++ b/src/include/common/unicode_combining_table.h
@@ -0,0 +1,194 @@
+/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */
+
+static const struct mbinterval combining[] = {
+	{0x0300, 0x036F},
+	{0x0483, 0x0489},
+	{0x0591, 0x05BD},
+	{0x05BF, 0x05BF},
+	{0x05C1, 0x05C2},
+	{0x05C4, 0x05C5},
+	{0x05C7, 0x05C7},
+	{0x0610, 0x061A},
+	{0x064B, 0x065F},
+	{0x0670, 0x0670},
+	{0x06D6, 0x06DC},
+	{0x06DF, 0x06E4},
+	{0x06E7, 0x06E8},
+	{0x06EA, 0x06ED},
+	{0x0711, 0x0711},
+	{0x0730, 0x074A},
+	{0x07A6, 0x07B0},
+	{0x07EB, 0x07F3},
+	{0x07FD, 0x07FD},
+	{0x0816, 0x0819},
+	{0x081B, 0x0823},
+	{0x0825, 0x0827},
+	{0x0829, 0x082D},
+	{0x0859, 0x085B},
+	{0x08D3, 0x08E1},
+	{0x08E3, 0x0902},
+	{0x093A, 0x093A},
+	{0x093C, 0x093C},
+	{0x0941, 0x0948},
+	{0x094D, 0x094D},
+	{0x0951, 0x0957},
+	{0x0962, 0x0963},
+	{0x0981, 0x0981},
+	{0x09BC, 0x09BC},
+	{0x09C1, 0x09C4},
+	{0x09CD, 0x09CD},
+	{0x09E2, 0x09E3},
+	{0x09FE, 0x0A02},
+	{0x0A3C, 0x0A3C},
+	{0x0A41, 0x0A51},
+	{0x0A70, 0x0A71},
+	{0x0A75, 0x0A75},
+	{0x0A81, 0x0A82},
+	{0x0ABC, 0x0ABC},
+	{0x0AC1, 0x0AC8},
+	{0x0ACD, 0x0ACD},
+	{0x0AE2, 0x0AE3},
+	{0x0AFA, 0x0B01},
+	{0x0B3C, 0x0B3C},
+	{0x0B3F, 0x0B3F},
+	{0x0B41, 0x0B44},
+	{0x0B4D, 0x0B56},
+	{0x0B62, 0x0B63},
+	{0x0B82, 0x0B82},
+	{0x0BC0, 0x0BC0},
+	{0x0BCD, 0x0BCD},
+	{0x0C00, 0x0C00},
+	{0x0C04, 0x0C04},
+	{0x0C3E, 0x0C40},
+	{0x0C46, 0x0C56},
+	{0x0C62, 0x0C63},
+	{0x0C81, 0x0C81},
+	{0x0CBC, 0x0CBC},
+	{0x0CBF, 0x0CBF},
+	{0x0CC6, 0x0CC6},
+	{0x0CCC, 0x0CCD},
+	{0x0CE2, 0x0CE3},
+	{0x0D00, 0x0D01},
+	{0x0D3B, 0x0D3C},
+	{0x0D41, 0x0D44},
+	{0x0D4D, 0x0D4D},
+	{0x0D62, 0x0D63},
+	{0x0DCA, 0x0DCA},
+	{0x0DD2, 0x0DD6},
+	{0x0E31, 0x0E31},
+	{0x0E34, 0x0E3A},
+	{0x0E47, 0x0E4E},
+	{0x0EB1, 0x0EB1},
+	{0x0EB4, 0x0EBC},
+	{0x0EC8, 0x0ECD},
+	{0x0F18, 0x0F19},
+	{0x0F35, 0x0F35},
+	{0x0F37, 0x0F37},
+	{0x0F39, 0x0F39},
+	{0x0F71, 0x0F7E},
+	{0x0F80, 0x0F84},
+	{0x0F86, 0x0F87},
+	{0x0F8D, 0x0FBC},
+	{0x0FC6, 0x0FC6},
+	{0x102D, 0x1030},
+	{0x1032, 0x1037},
+	{0x1039, 0x103A},
+	{0x103D, 0x103E},
+	{0x1058, 0x1059},
+	{0x105E, 0x1060},
+	{0x1071, 0x1074},
+	{0x1082, 0x1082},
+	{0x1085, 0x1086},
+	{0x108D, 0x108D},
+	{0x109D, 0x109D},
+	{0x135D, 0x135F},
+	{0x1712, 0x1714},
+	{0x1732, 0x1734},
+	{0x1752, 0x1753},
+	{0x1772, 0x1773},
+	{0x17B4, 0x17B5},
+	{0x17B7, 0x17BD},
+	{0x17C6, 0x17C6},
+	{0x17C9, 0x17D3},
+	{0x17DD, 0x17DD},
+	{0x180B, 0x180D},
+	{0x1885, 0x1886},
+	{0x18A9, 0x18A9},
+	{0x1920, 0x1922},
+	{0x1927, 0x1928},
+	{0x1932, 0x1932},
+	{0x1939, 0x193B},
+	{0x1A17, 0x1A18},
+	{0x1A1B, 0x1A1B},
+	{0x1A56, 0x1A56},
+	{0x1A58, 0x1A60},
+	{0x1A62, 0x1A62},
+	{0x1A65, 0x1A6C},
+	{0x1A73, 0x1A7F},
+	{0x1AB0, 0x1B03},
+	{0x1B34, 0x1B34},
+	{0x1B36, 0x1B3A},
+	{0x1B3C, 0x1B3C},
+	{0x1B42, 0x1B42},
+	{0x1B6B, 0x1B73},
+	{0x1B80, 0x1B81},
+	{0x1BA2, 0x1BA5},
+	{0x1BA8, 0x1BA9},
+	{0x1BAB, 0x1BAD},
+	{0x1BE6, 0x1BE6},
+	{0x1BE8, 0x1BE9},
+	{0x1BED, 0x1BED},
+	{0x1BEF, 0x1BF1},
+	{0x1C2C, 0x1C33},
+	{0x1C36, 0x1C37},
+	{0x1CD0, 0x1CD2},
+	{0x1CD4, 0x1CE0},
+	{0x1CE2, 0x1CE8},
+	{0x1CED, 0x1CED},
+	{0x1CF4, 0x1CF4},
+	{0x1CF8, 0x1CF9},
+	{0x1DC0, 0x1DFF},
+	{0x20D0, 0x20F0},
+	{0x2CEF, 0x2CF1},
+	{0x2D7F, 0x2D7F},
+	{0x2DE0, 0x2DFF},
+	{0x302A, 0x302D},
+	{0x3099, 0x309A},
+	{0xA66F, 0xA672},
+	{0xA674, 0xA67D},
+	{0xA69E, 0xA69F},
+	{0xA6F0, 0xA6F1},
+	{0xA802, 0xA802},
+	{0xA806, 0xA806},
+	{0xA80B, 0xA80B},
+	{0xA825, 0xA826},
+	{0xA8C4, 0xA8C5},
+	{0xA8E0, 0xA8F1},
+	{0xA8FF, 0xA8FF},
+	{0xA926, 0xA92D},
+	{0xA947, 0xA951},
+	{0xA980, 0xA982},
+	{0xA9B3, 0xA9B3},
+	{0xA9B6, 0xA9B9},
+	{0xA9BC, 0xA9BD},
+	{0xA9E5, 0xA9E5},
+	{0xAA29, 0xAA2E},
+	{0xAA31, 0xAA32},
+	{0xAA35, 0xAA36},
+	{0xAA43, 0xAA43},
+	{0xAA4C, 0xAA4C},
+	{0xAA7C, 0xAA7C},
+	{0xAAB0, 0xAAB0},
+	{0xAAB2, 0xAAB4},
+	{0xAAB7, 0xAAB8},
+	{0xAABE, 0xAABF},
+	{0xAAC1, 0xAAC1},
+	{0xAAEC, 0xAAED},
+	{0xAAF6, 0xAAF6},
+	{0xABE5, 0xABE5},
+	{0xABE8, 0xABE8},
+	{0xABED, 0xABED},
+	{0xFB1E, 0xFB1E},
+	{0xFE00, 0xFE0F},
+	{0xFE20, 0xFE2F},
+};
diff --git a/src/tools/RELEASE_CHANGES b/src/tools/RELEASE_CHANGES
index 46139877ed..a7bff76b76 100644
--- a/src/tools/RELEASE_CHANGES
+++ b/src/tools/RELEASE_CHANGES
@@ -77,6 +77,9 @@ but there may be reasons to do them at other times as well.
 
 * Update inet/cidr data types with newest Bind patches
 
+* Update Unicode data: Edit UNICODE_VERSION and CLDR_VERSION in
+  src/Makefile.global.in, run make update-unicode, and commit.
+
 
 Starting a New Development Cycle
 ================================
-- 
2.23.0

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: Peter Eisentraut (#1)

Re: Add support for automatically updating Unicode derived files

On Tue, Oct 29, 2019 at 6:06 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Continuing the discussion from [0] and [1], here is a patch that
automates the process of updating Unicode derived files. Summary:

- Edit UNICODE_VERSION and/or CLDR_VERSION in src/Makefile.global.in
- Run make update-unicode
- Commit

Hi Peter,

I gave "make update-unicode" a try. It's unclear to me what the state
of the build tree should be when a maintainer runs this, so I'll just
report what happens when running naively (on MacOS).

After only running configure, "make update-unicode" gives this error
at normalization-check:

ld: library not found for -lpgcommon
clang: error: linker command failed with exit code 1 (use -v to see invocation)

After commenting that out, the next command "$(MAKE) -C
contrib/unaccent $@" failed, seemingly because $(PYTHON) is empty
unless --with-python was specified at configure time.

Open questions that are currently not handled consistently:

- Should the downloaded files be listed in .gitignore?

These files are transient byproducts of a build, and we don't want
them committed, so they seem like a normal candidate for .gitignore.

- Should the downloaded files be cleaned by make clean (or distclean or
maintainer-clean or none)?

It seems one would want to make clean without removing these files,
and maintainer clean is for removing things that are preserved in
distribution tarballs. So I would go with distclean.

- Should the generated files be excluded from pgindent? Currently, the
generated files will not pass pgindent unchanged, so that could cause
annoying whitespace battles when these files are updated and re-indented
around release time.

I see what you mean in the norm table header. I think generated files
should not be pgindent'd, since creating them is already a consistent,
mechanical process, and their presentation is not as important as
other code.

Other comments:

+print "/* generated by
src/common/unicode/generate-unicode_combining_table.pl, do not edit
*/\n\n";

I would print out the full boilerplate like for other generated headers.

Lastly, src/common/unicode/README is outdated (and possibly no longer
useful at all?).

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

about 6 years ago

In reply to: John Naylor (#2)

1 attachment(s)

Re: Add support for automatically updating Unicode derived files

On 2019-12-19 23:48, John Naylor wrote:

I gave "make update-unicode" a try. It's unclear to me what the state
of the build tree should be when a maintainer runs this, so I'll just
report what happens when running naively (on MacOS).

Yeah, that wasn't fully thought through, it appears.

After only running configure, "make update-unicode" gives this error
at normalization-check:

ld: library not found for -lpgcommon
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Fixed by adding more make dependencies.

After commenting that out, the next command "$(MAKE) -C
contrib/unaccent $@" failed, seemingly because $(PYTHON) is empty
unless --with-python was specified at configure time.

I'm not sure whether that's worth addressing.

Open questions that are currently not handled consistently:

- Should the downloaded files be listed in .gitignore?

These files are transient byproducts of a build, and we don't want
them committed, so they seem like a normal candidate for .gitignore.

OK done

- Should the downloaded files be cleaned by make clean (or distclean or
maintainer-clean or none)?

It seems one would want to make clean without removing these files,
and maintainer clean is for removing things that are preserved in
distribution tarballs. So I would go with distclean.

also done

- Should the generated files be excluded from pgindent? Currently, the
generated files will not pass pgindent unchanged, so that could cause
annoying whitespace battles when these files are updated and re-indented
around release time.

I see what you mean in the norm table header. I think generated files
should not be pgindent'd, since creating them is already a consistent,
mechanical process, and their presentation is not as important as
other code.

I've left it alone for now because the little indentation problem
currently present might actually go away with my Unicode normalization
support patch.

Other comments:

+print "/* generated by
src/common/unicode/generate-unicode_combining_table.pl, do not edit
*/\n\n";

I would print out the full boilerplate like for other generated headers.

Hmm, you are probably comparing with
src/common/unicode/generate-unicode_norm_table.pl, but other file
generating scripts around the tree print out a small header in the style
that I have. I'd rather adjust the output of
generate-unicode_norm_table.pl to match those. (It's also not quite
correct to make copyright claims about automatically generated output.)

Lastly, src/common/unicode/README is outdated (and possibly no longer
useful at all?).

updated

new patch attached

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-0001-Add-support-for-automatically-updating-Unicode-de.patchtext/plain; charset=UTF-8; name=v2-0001-Add-support-for-automatically-updating-Unicode-de.patch; x-mac-creator=0; x-mac-type=0Download

From cd55990f1846b0d3cfaf0fe2fd92d5a3fd792dd6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 26 Dec 2019 12:33:36 +0100
Subject: [PATCH v2] Add support for automatically updating Unicode derived
 files

We currently have several sets of files generated from data provided
by Unicode.  These all have ad hoc rules and instructions for updating
when new Unicode versions appear, and it's not done consistently.

This patch centralizes and automates the process and makes it part of
the release checklist.  The Unicode and CLDR versions are specified in
Makefile.global.in.  There is a new make target "update-unicode" that
downloads all the relevant files and runs the generation script.

There is also a new script for generating the table of combining
characters for ucs_wcwidth().  That table is now in a separate include
file rather than hardcoded into the middle of other code.  This is
based on the script that was used for generating
d8594d123c155aeecd47fc2450f62f5100b2fbf0, but the script itself wasn't
committed at that time.

Discussion: https://www.postgresql.org/message-id/flat/c8d05f42-443e-6c23-819b-05b31759a37c@2ndquadrant.com
---
 GNUmakefile.in                                |   4 +
 contrib/unaccent/.gitignore                   |   3 +
 contrib/unaccent/Makefile                     |  16 ++
 contrib/unaccent/generate_unaccent_rules.py   |  10 +-
 src/Makefile.global.in                        |  18 +-
 src/backend/utils/mb/Unicode/Makefile         |   3 -
 src/backend/utils/mb/wchar.c                  |  68 +-----
 src/common/unicode/.gitignore                 |   2 +-
 src/common/unicode/Makefile                   |  14 +-
 src/common/unicode/README                     |  17 +-
 .../generate-unicode_combining_table.pl       |  52 +++++
 src/include/common/unicode_combining_table.h  | 194 ++++++++++++++++++
 src/tools/RELEASE_CHANGES                     |   3 +
 13 files changed, 310 insertions(+), 94 deletions(-)
 create mode 100644 src/common/unicode/generate-unicode_combining_table.pl
 create mode 100644 src/include/common/unicode_combining_table.h

diff --git a/GNUmakefile.in b/GNUmakefile.in
index 9dc373c79c..ee636e3b50 100644
--- a/GNUmakefile.in
+++ b/GNUmakefile.in
@@ -75,6 +75,10 @@ $(call recurse,installcheck-world,src/test src/pl src/interfaces/ecpg contrib sr
 GNUmakefile: GNUmakefile.in $(top_builddir)/config.status
 	./config.status $@
 
+update-unicode: | submake-generated-headers submake-libpgport
+	$(MAKE) -C src/common/unicode $@
+	$(MAKE) -C contrib/unaccent $@
+
 
 ##########################################################################
 
diff --git a/contrib/unaccent/.gitignore b/contrib/unaccent/.gitignore
index 5dcb3ff972..bccda7317d 100644
--- a/contrib/unaccent/.gitignore
+++ b/contrib/unaccent/.gitignore
@@ -2,3 +2,6 @@
 /log/
 /results/
 /tmp_check/
+
+# Downloaded files
+/Latin-ASCII.xml
diff --git a/contrib/unaccent/Makefile b/contrib/unaccent/Makefile
index 92b7f9d78e..0f40f89c2b 100644
--- a/contrib/unaccent/Makefile
+++ b/contrib/unaccent/Makefile
@@ -26,3 +26,19 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+update-unicode: unaccent.rules
+
+unaccent.rules: generate_unaccent_rules.py ../../src/common/unicode/UnicodeData.txt Latin-ASCII.xml
+	$(PYTHON) $< --unicode-data-file $(word 2,$^) --latin-ascii-file $(word 3,$^) >$@
+
+# only download it once
+../../src/common/unicode/UnicodeData.txt:
+	$(MAKE) -C $(@D) $(@F)
+
+# dependency on Makefile.global is for CLDR_VERSION
+Latin-ASCII.xml: $(top_builddir)/src/Makefile.global
+	$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/cldr/release-$(subst .,-,$(CLDR_VERSION))/common/transforms/Latin-ASCII.xml
+
+distclean:
+	rm -f Latin-ASCII.xml
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index acfb4f0b68..a952de510c 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -20,13 +20,11 @@
 # option is enabled, the XML file of this transliterator [2] -- given as a
 # command line argument -- will be parsed and used.
 #
-# Ideally you should use the latest release for each data set.  For
-# Latin-ASCII.xml, the latest data sets released can be browsed directly
-# via [3].  Note that this script is compatible with at least release 29.
+# Ideally you should use the latest release for each data set.  This
+# script is compatible with at least CLDR release 29.
 #
-# [1] https://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt
-# [2] https://raw.githubusercontent.com/unicode-org/cldr/release-34/common/transforms/Latin-ASCII.xml
-# [3] https://github.com/unicode-org/cldr/tags
+# [1] https://www.unicode.org/Public/${UNICODE_VERSION}/ucd/UnicodeData.txt
+# [2] https://raw.githubusercontent.com/unicode-org/cldr/${TAG}/common/transforms/Latin-ASCII.xml
 
 # BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
 # The approach is to be Python3 compatible with Python2 "backports".
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 05b66380e0..1e2f4ee405 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -23,7 +23,7 @@ standard_targets = all install installdirs uninstall distprep clean distclean ma
 # these targets should recurse even into subdirectories not being built:
 standard_always_targets = distprep clean distclean maintainer-clean
 
-.PHONY: $(standard_targets) install-strip html man installcheck-parallel
+.PHONY: $(standard_targets) install-strip html man installcheck-parallel update-unicode
 
 # make `all' the default target
 all:
@@ -351,6 +351,22 @@ XGETTEXT = @XGETTEXT@
 GZIP	= gzip
 BZIP2	= bzip2
 
+DOWNLOAD = wget -O $@ --no-use-server-timestamps
+#DOWNLOAD = curl -o $@
+
+
+# Unicode data information
+
+# Before each major release, update these and run make update-unicode.
+
+# Pick a release from here: <https://www.unicode.org/Public/>.  Note
+# that the most recent release listed there is often a pre-release;
+# don't pick that one, except for testing.
+UNICODE_VERSION = 12.1.0
+
+# Pick a release from here: <http://cldr.unicode.org/index/downloads>
+CLDR_VERSION = 34
+
 
 # Tree-wide build support
 
diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index 63710f9ea7..20c6849a65 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -115,9 +115,6 @@ maintainer-clean: distclean
 	rm -f $(MAPS)
 
 
-DOWNLOAD = wget -O $@ --no-use-server-timestamps
-#DOWNLOAD = curl -o $@
-
 BIG5.TXT CNS11643.TXT:
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/$(@F)
 
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index b2d598cbee..02e2588ffe 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -643,73 +643,7 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
 static int
 ucs_wcwidth(pg_wchar ucs)
 {
-	/* sorted list of non-overlapping intervals of non-spacing characters */
-	static const struct mbinterval combining[] = {
-		{0x0300, 0x036F}, {0x0483, 0x0489}, {0x0591, 0x05BD},
-		{0x05BF, 0x05BF}, {0x05C1, 0x05C2}, {0x05C4, 0x05C5},
-		{0x05C7, 0x05C7}, {0x0610, 0x061A}, {0x064B, 0x065F},
-		{0x0670, 0x0670}, {0x06D6, 0x06DC}, {0x06DF, 0x06E4},
-		{0x06E7, 0x06E8}, {0x06EA, 0x06ED}, {0x0711, 0x0711},
-		{0x0730, 0x074A}, {0x07A6, 0x07B0}, {0x07EB, 0x07F3},
-		{0x07FD, 0x07FD}, {0x0816, 0x0819}, {0x081B, 0x0823},
-		{0x0825, 0x0827}, {0x0829, 0x082D}, {0x0859, 0x085B},
-		{0x08D3, 0x08E1}, {0x08E3, 0x0902}, {0x093A, 0x093A},
-		{0x093C, 0x093C}, {0x0941, 0x0948}, {0x094D, 0x094D},
-		{0x0951, 0x0957}, {0x0962, 0x0963}, {0x0981, 0x0981},
-		{0x09BC, 0x09BC}, {0x09C1, 0x09C4}, {0x09CD, 0x09CD},
-		{0x09E2, 0x09E3}, {0x09FE, 0x0A02}, {0x0A3C, 0x0A3C},
-		{0x0A41, 0x0A51}, {0x0A70, 0x0A71}, {0x0A75, 0x0A75},
-		{0x0A81, 0x0A82}, {0x0ABC, 0x0ABC}, {0x0AC1, 0x0AC8},
-		{0x0ACD, 0x0ACD}, {0x0AE2, 0x0AE3}, {0x0AFA, 0x0B01},
-		{0x0B3C, 0x0B3C}, {0x0B3F, 0x0B3F}, {0x0B41, 0x0B44},
-		{0x0B4D, 0x0B56}, {0x0B62, 0x0B63}, {0x0B82, 0x0B82},
-		{0x0BC0, 0x0BC0}, {0x0BCD, 0x0BCD}, {0x0C00, 0x0C00},
-		{0x0C04, 0x0C04}, {0x0C3E, 0x0C40}, {0x0C46, 0x0C56},
-		{0x0C62, 0x0C63}, {0x0C81, 0x0C81}, {0x0CBC, 0x0CBC},
-		{0x0CBF, 0x0CBF}, {0x0CC6, 0x0CC6}, {0x0CCC, 0x0CCD},
-		{0x0CE2, 0x0CE3}, {0x0D00, 0x0D01}, {0x0D3B, 0x0D3C},
-		{0x0D41, 0x0D44}, {0x0D4D, 0x0D4D}, {0x0D62, 0x0D63},
-		{0x0DCA, 0x0DCA}, {0x0DD2, 0x0DD6}, {0x0E31, 0x0E31},
-		{0x0E34, 0x0E3A}, {0x0E47, 0x0E4E}, {0x0EB1, 0x0EB1},
-		{0x0EB4, 0x0EBC}, {0x0EC8, 0x0ECD}, {0x0F18, 0x0F19},
-		{0x0F35, 0x0F35}, {0x0F37, 0x0F37}, {0x0F39, 0x0F39},
-		{0x0F71, 0x0F7E}, {0x0F80, 0x0F84}, {0x0F86, 0x0F87},
-		{0x0F8D, 0x0FBC}, {0x0FC6, 0x0FC6}, {0x102D, 0x1030},
-		{0x1032, 0x1037}, {0x1039, 0x103A}, {0x103D, 0x103E},
-		{0x1058, 0x1059}, {0x105E, 0x1060}, {0x1071, 0x1074},
-		{0x1082, 0x1082}, {0x1085, 0x1086}, {0x108D, 0x108D},
-		{0x109D, 0x109D}, {0x135D, 0x135F}, {0x1712, 0x1714},
-		{0x1732, 0x1734}, {0x1752, 0x1753}, {0x1772, 0x1773},
-		{0x17B4, 0x17B5}, {0x17B7, 0x17BD}, {0x17C6, 0x17C6},
-		{0x17C9, 0x17D3}, {0x17DD, 0x17DD}, {0x180B, 0x180D},
-		{0x1885, 0x1886}, {0x18A9, 0x18A9}, {0x1920, 0x1922},
-		{0x1927, 0x1928}, {0x1932, 0x1932}, {0x1939, 0x193B},
-		{0x1A17, 0x1A18}, {0x1A1B, 0x1A1B}, {0x1A56, 0x1A56},
-		{0x1A58, 0x1A60}, {0x1A62, 0x1A62}, {0x1A65, 0x1A6C},
-		{0x1A73, 0x1A7F}, {0x1AB0, 0x1B03}, {0x1B34, 0x1B34},
-		{0x1B36, 0x1B3A}, {0x1B3C, 0x1B3C}, {0x1B42, 0x1B42},
-		{0x1B6B, 0x1B73}, {0x1B80, 0x1B81}, {0x1BA2, 0x1BA5},
-		{0x1BA8, 0x1BA9}, {0x1BAB, 0x1BAD}, {0x1BE6, 0x1BE6},
-		{0x1BE8, 0x1BE9}, {0x1BED, 0x1BED}, {0x1BEF, 0x1BF1},
-		{0x1C2C, 0x1C33}, {0x1C36, 0x1C37}, {0x1CD0, 0x1CD2},
-		{0x1CD4, 0x1CE0}, {0x1CE2, 0x1CE8}, {0x1CED, 0x1CED},
-		{0x1CF4, 0x1CF4}, {0x1CF8, 0x1CF9}, {0x1DC0, 0x1DFF},
-		{0x20D0, 0x20F0}, {0x2CEF, 0x2CF1}, {0x2D7F, 0x2D7F},
-		{0x2DE0, 0x2DFF}, {0x302A, 0x302D}, {0x3099, 0x309A},
-		{0xA66F, 0xA672}, {0xA674, 0xA67D}, {0xA69E, 0xA69F},
-		{0xA6F0, 0xA6F1}, {0xA802, 0xA802}, {0xA806, 0xA806},
-		{0xA80B, 0xA80B}, {0xA825, 0xA826}, {0xA8C4, 0xA8C5},
-		{0xA8E0, 0xA8F1}, {0xA8FF, 0xA8FF}, {0xA926, 0xA92D},
-		{0xA947, 0xA951}, {0xA980, 0xA982}, {0xA9B3, 0xA9B3},
-		{0xA9B6, 0xA9B9}, {0xA9BC, 0xA9BD}, {0xA9E5, 0xA9E5},
-		{0xAA29, 0xAA2E}, {0xAA31, 0xAA32}, {0xAA35, 0xAA36},
-		{0xAA43, 0xAA43}, {0xAA4C, 0xAA4C}, {0xAA7C, 0xAA7C},
-		{0xAAB0, 0xAAB0}, {0xAAB2, 0xAAB4}, {0xAAB7, 0xAAB8},
-		{0xAABE, 0xAABF}, {0xAAC1, 0xAAC1}, {0xAAEC, 0xAAED},
-		{0xAAF6, 0xAAF6}, {0xABE5, 0xABE5}, {0xABE8, 0xABE8},
-		{0xABED, 0xABED}, {0xFB1E, 0xFB1E}, {0xFE00, 0xFE0F},
-		{0xFE20, 0xFE2F},
-	};
+#include "common/unicode_combining_table.h"
 
 	/* test for 8-bit control characters */
 	if (ucs == 0)
diff --git a/src/common/unicode/.gitignore b/src/common/unicode/.gitignore
index 5e583e2ccc..b5a4d84274 100644
--- a/src/common/unicode/.gitignore
+++ b/src/common/unicode/.gitignore
@@ -1,7 +1,7 @@
 /norm_test
 /norm_test_table.h
 
-# Files downloaded from the Unicode Character Database
+# Downloaded files
 /CompositionExclusions.txt
 /NormalizationTest.txt
 /UnicodeData.txt
diff --git a/src/common/unicode/Makefile b/src/common/unicode/Makefile
index 334859c984..ec78aeec2a 100644
--- a/src/common/unicode/Makefile
+++ b/src/common/unicode/Makefile
@@ -18,18 +18,24 @@ LIBS += $(PTHREAD_LIBS)
 # By default, do nothing.
 all:
 
-DOWNLOAD = wget -O $@ --no-use-server-timestamps
+update-unicode: unicode_norm_table.h unicode_combining_table.h
+	$(MAKE) normalization-check
+	mv unicode_norm_table.h unicode_combining_table.h ../../../src/include/common/
 
 # These files are part of the Unicode Character Database. Download
-# them on demand.
-UnicodeData.txt CompositionExclusions.txt NormalizationTest.txt:
-	$(DOWNLOAD) https://www.unicode.org/Public/UNIDATA/$(@F)
+# them on demand.  The dependency on Makefile.global is for
+# UNICODE_VERSION.
+UnicodeData.txt CompositionExclusions.txt NormalizationTest.txt: $(top_builddir)/src/Makefile.global
+	$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)
 
 # Generation of conversion tables used for string normalization with
 # UTF-8 strings.
 unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt CompositionExclusions.txt
 	$(PERL) generate-unicode_norm_table.pl
 
+unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
+	$(PERL) $^ >$@
+
 # Test suite
 normalization-check: norm_test
 	./norm_test
diff --git a/src/common/unicode/README b/src/common/unicode/README
index 5aa79044d3..56956f6a65 100644
--- a/src/common/unicode/README
+++ b/src/common/unicode/README
@@ -8,20 +8,11 @@ of Unicode.
 Generating unicode_norm_table.h
 -------------------------------
 
-1. Download the Unicode data file, UnicodeData.txt, from the Unicode
-consortium and place it to the current directory. Run the perl script
-"generate-unicode_norm_table.pl", to process it, and to generate the
-"unicode_norm_table.h" file. The Makefile contains a rule to download the
-data files if they don't exist.
-
-    make unicode_norm_table.h
-
-2. Inspect the resulting header file. Once you're happy with it, copy it to
-the right location.
-
-    cp unicode_norm_table.h ../../../src/include/common/
+Run
 
+    make update-unicode
 
+from the top level of the source tree and commit the result.
 
 Tests
 -----
@@ -33,3 +24,5 @@ normalization code with all the test strings in NormalizationTest.txt.
 To download NormalizationTest.txt and run the tests:
 
     make normalization-check
+
+This is also run as part of the update-unicode target.
diff --git a/src/common/unicode/generate-unicode_combining_table.pl b/src/common/unicode/generate-unicode_combining_table.pl
new file mode 100644
index 0000000000..e468a5f8c9
--- /dev/null
+++ b/src/common/unicode/generate-unicode_combining_table.pl
@@ -0,0 +1,52 @@
+#!/usr/bin/perl
+#
+# Generate sorted list of non-overlapping intervals of non-spacing
+# characters, using Unicode data files as input.  Pass UnicodeData.txt
+# as argument.  The output is on stdout.
+#
+# Copyright (c) 2019, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+my $range_start = undef;
+my $codepoint;
+my $prev_codepoint;
+my $count = 0;
+
+print "/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */\n\n";
+
+print "static const struct mbinterval combining[] = {\n";
+
+foreach my $line (<ARGV>)
+{
+    chomp $line;
+    my @fields = split ';', $line;
+    $codepoint = hex $fields[0];
+
+    next if $codepoint > 0xFFFF;
+
+    if ($fields[2] eq 'Me' || $fields[2] eq 'Mn')
+    {
+        # combining character, save for start of range
+        if (!defined($range_start))
+        {
+            $range_start = $codepoint;
+        }
+    }
+    else
+    {
+        # not a combining character, print out previous range if any
+        if (defined($range_start))
+        {
+            printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_codepoint;
+            $range_start = undef;
+        }
+    }
+}
+continue
+{
+    $prev_codepoint = $codepoint;
+}
+
+print "};\n";
diff --git a/src/include/common/unicode_combining_table.h b/src/include/common/unicode_combining_table.h
new file mode 100644
index 0000000000..b4a8588238
--- /dev/null
+++ b/src/include/common/unicode_combining_table.h
@@ -0,0 +1,194 @@
+/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */
+
+static const struct mbinterval combining[] = {
+	{0x0300, 0x036F},
+	{0x0483, 0x0489},
+	{0x0591, 0x05BD},
+	{0x05BF, 0x05BF},
+	{0x05C1, 0x05C2},
+	{0x05C4, 0x05C5},
+	{0x05C7, 0x05C7},
+	{0x0610, 0x061A},
+	{0x064B, 0x065F},
+	{0x0670, 0x0670},
+	{0x06D6, 0x06DC},
+	{0x06DF, 0x06E4},
+	{0x06E7, 0x06E8},
+	{0x06EA, 0x06ED},
+	{0x0711, 0x0711},
+	{0x0730, 0x074A},
+	{0x07A6, 0x07B0},
+	{0x07EB, 0x07F3},
+	{0x07FD, 0x07FD},
+	{0x0816, 0x0819},
+	{0x081B, 0x0823},
+	{0x0825, 0x0827},
+	{0x0829, 0x082D},
+	{0x0859, 0x085B},
+	{0x08D3, 0x08E1},
+	{0x08E3, 0x0902},
+	{0x093A, 0x093A},
+	{0x093C, 0x093C},
+	{0x0941, 0x0948},
+	{0x094D, 0x094D},
+	{0x0951, 0x0957},
+	{0x0962, 0x0963},
+	{0x0981, 0x0981},
+	{0x09BC, 0x09BC},
+	{0x09C1, 0x09C4},
+	{0x09CD, 0x09CD},
+	{0x09E2, 0x09E3},
+	{0x09FE, 0x0A02},
+	{0x0A3C, 0x0A3C},
+	{0x0A41, 0x0A51},
+	{0x0A70, 0x0A71},
+	{0x0A75, 0x0A75},
+	{0x0A81, 0x0A82},
+	{0x0ABC, 0x0ABC},
+	{0x0AC1, 0x0AC8},
+	{0x0ACD, 0x0ACD},
+	{0x0AE2, 0x0AE3},
+	{0x0AFA, 0x0B01},
+	{0x0B3C, 0x0B3C},
+	{0x0B3F, 0x0B3F},
+	{0x0B41, 0x0B44},
+	{0x0B4D, 0x0B56},
+	{0x0B62, 0x0B63},
+	{0x0B82, 0x0B82},
+	{0x0BC0, 0x0BC0},
+	{0x0BCD, 0x0BCD},
+	{0x0C00, 0x0C00},
+	{0x0C04, 0x0C04},
+	{0x0C3E, 0x0C40},
+	{0x0C46, 0x0C56},
+	{0x0C62, 0x0C63},
+	{0x0C81, 0x0C81},
+	{0x0CBC, 0x0CBC},
+	{0x0CBF, 0x0CBF},
+	{0x0CC6, 0x0CC6},
+	{0x0CCC, 0x0CCD},
+	{0x0CE2, 0x0CE3},
+	{0x0D00, 0x0D01},
+	{0x0D3B, 0x0D3C},
+	{0x0D41, 0x0D44},
+	{0x0D4D, 0x0D4D},
+	{0x0D62, 0x0D63},
+	{0x0DCA, 0x0DCA},
+	{0x0DD2, 0x0DD6},
+	{0x0E31, 0x0E31},
+	{0x0E34, 0x0E3A},
+	{0x0E47, 0x0E4E},
+	{0x0EB1, 0x0EB1},
+	{0x0EB4, 0x0EBC},
+	{0x0EC8, 0x0ECD},
+	{0x0F18, 0x0F19},
+	{0x0F35, 0x0F35},
+	{0x0F37, 0x0F37},
+	{0x0F39, 0x0F39},
+	{0x0F71, 0x0F7E},
+	{0x0F80, 0x0F84},
+	{0x0F86, 0x0F87},
+	{0x0F8D, 0x0FBC},
+	{0x0FC6, 0x0FC6},
+	{0x102D, 0x1030},
+	{0x1032, 0x1037},
+	{0x1039, 0x103A},
+	{0x103D, 0x103E},
+	{0x1058, 0x1059},
+	{0x105E, 0x1060},
+	{0x1071, 0x1074},
+	{0x1082, 0x1082},
+	{0x1085, 0x1086},
+	{0x108D, 0x108D},
+	{0x109D, 0x109D},
+	{0x135D, 0x135F},
+	{0x1712, 0x1714},
+	{0x1732, 0x1734},
+	{0x1752, 0x1753},
+	{0x1772, 0x1773},
+	{0x17B4, 0x17B5},
+	{0x17B7, 0x17BD},
+	{0x17C6, 0x17C6},
+	{0x17C9, 0x17D3},
+	{0x17DD, 0x17DD},
+	{0x180B, 0x180D},
+	{0x1885, 0x1886},
+	{0x18A9, 0x18A9},
+	{0x1920, 0x1922},
+	{0x1927, 0x1928},
+	{0x1932, 0x1932},
+	{0x1939, 0x193B},
+	{0x1A17, 0x1A18},
+	{0x1A1B, 0x1A1B},
+	{0x1A56, 0x1A56},
+	{0x1A58, 0x1A60},
+	{0x1A62, 0x1A62},
+	{0x1A65, 0x1A6C},
+	{0x1A73, 0x1A7F},
+	{0x1AB0, 0x1B03},
+	{0x1B34, 0x1B34},
+	{0x1B36, 0x1B3A},
+	{0x1B3C, 0x1B3C},
+	{0x1B42, 0x1B42},
+	{0x1B6B, 0x1B73},
+	{0x1B80, 0x1B81},
+	{0x1BA2, 0x1BA5},
+	{0x1BA8, 0x1BA9},
+	{0x1BAB, 0x1BAD},
+	{0x1BE6, 0x1BE6},
+	{0x1BE8, 0x1BE9},
+	{0x1BED, 0x1BED},
+	{0x1BEF, 0x1BF1},
+	{0x1C2C, 0x1C33},
+	{0x1C36, 0x1C37},
+	{0x1CD0, 0x1CD2},
+	{0x1CD4, 0x1CE0},
+	{0x1CE2, 0x1CE8},
+	{0x1CED, 0x1CED},
+	{0x1CF4, 0x1CF4},
+	{0x1CF8, 0x1CF9},
+	{0x1DC0, 0x1DFF},
+	{0x20D0, 0x20F0},
+	{0x2CEF, 0x2CF1},
+	{0x2D7F, 0x2D7F},
+	{0x2DE0, 0x2DFF},
+	{0x302A, 0x302D},
+	{0x3099, 0x309A},
+	{0xA66F, 0xA672},
+	{0xA674, 0xA67D},
+	{0xA69E, 0xA69F},
+	{0xA6F0, 0xA6F1},
+	{0xA802, 0xA802},
+	{0xA806, 0xA806},
+	{0xA80B, 0xA80B},
+	{0xA825, 0xA826},
+	{0xA8C4, 0xA8C5},
+	{0xA8E0, 0xA8F1},
+	{0xA8FF, 0xA8FF},
+	{0xA926, 0xA92D},
+	{0xA947, 0xA951},
+	{0xA980, 0xA982},
+	{0xA9B3, 0xA9B3},
+	{0xA9B6, 0xA9B9},
+	{0xA9BC, 0xA9BD},
+	{0xA9E5, 0xA9E5},
+	{0xAA29, 0xAA2E},
+	{0xAA31, 0xAA32},
+	{0xAA35, 0xAA36},
+	{0xAA43, 0xAA43},
+	{0xAA4C, 0xAA4C},
+	{0xAA7C, 0xAA7C},
+	{0xAAB0, 0xAAB0},
+	{0xAAB2, 0xAAB4},
+	{0xAAB7, 0xAAB8},
+	{0xAABE, 0xAABF},
+	{0xAAC1, 0xAAC1},
+	{0xAAEC, 0xAAED},
+	{0xAAF6, 0xAAF6},
+	{0xABE5, 0xABE5},
+	{0xABE8, 0xABE8},
+	{0xABED, 0xABED},
+	{0xFB1E, 0xFB1E},
+	{0xFE00, 0xFE0F},
+	{0xFE20, 0xFE2F},
+};
diff --git a/src/tools/RELEASE_CHANGES b/src/tools/RELEASE_CHANGES
index 46139877ed..a7bff76b76 100644
--- a/src/tools/RELEASE_CHANGES
+++ b/src/tools/RELEASE_CHANGES
@@ -77,6 +77,9 @@ but there may be reasons to do them at other times as well.
 
 * Update inet/cidr data types with newest Bind patches
 
+* Update Unicode data: Edit UNICODE_VERSION and CLDR_VERSION in
+  src/Makefile.global.in, run make update-unicode, and commit.
+
 
 Starting a New Development Cycle
 ================================
-- 
2.24.1

John Naylor

john.naylor@2ndquadrant.com

about 6 years ago

In reply to: Peter Eisentraut (#3)

Re: Add support for automatically updating Unicode derived files

On Thu, Dec 26, 2019 at 12:39 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2019-12-19 23:48, John Naylor wrote:

I would print out the full boilerplate like for other generated headers.

Hmm, you are probably comparing with
src/common/unicode/generate-unicode_norm_table.pl, but other file
generating scripts around the tree print out a small header in the style
that I have. I'd rather adjust the output of
generate-unicode_norm_table.pl to match those. (It's also not quite
correct to make copyright claims about automatically generated output.)

Hmm, the scripts I'm most familiar with have full headers. Your point
about copyright makes sense, and using smaller file headers would aid
readability of the scripts, but I also see how others may feel
differently.

v2 looks good to me, marked ready for committer.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

about 6 years ago

In reply to: John Naylor (#4)

Re: Add support for automatically updating Unicode derived files

On 2020-01-03 15:13, John Naylor wrote:

On Thu, Dec 26, 2019 at 12:39 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2019-12-19 23:48, John Naylor wrote:

I would print out the full boilerplate like for other generated headers.

Hmm, you are probably comparing with
src/common/unicode/generate-unicode_norm_table.pl, but other file
generating scripts around the tree print out a small header in the style
that I have. I'd rather adjust the output of
generate-unicode_norm_table.pl to match those. (It's also not quite
correct to make copyright claims about automatically generated output.)

Hmm, the scripts I'm most familiar with have full headers. Your point
about copyright makes sense, and using smaller file headers would aid
readability of the scripts, but I also see how others may feel
differently.

v2 looks good to me, marked ready for committer.

Committed, thanks.

I have added a little tweak so that it works also without --with-python,
to avoid gratuitous annoyances.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Peter Eisentraut (#5)

Re: Add support for automatically updating Unicode derived files

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

Committed, thanks.

This patch is making src/tools/pginclude/headerscheck unhappy:

./src/include/common/unicode_combining_table.h:3: error: array type has incomplete element type

I guess that header needs another #include, or else you need to
move some declarations around.

regards, tom lane

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#6)

Re: Add support for automatically updating Unicode derived files

On 2020-01-15 01:37, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

Committed, thanks.

This patch is making src/tools/pginclude/headerscheck unhappy:

./src/include/common/unicode_combining_table.h:3: error: array type has incomplete element type

I guess that header needs another #include, or else you need to
move some declarations around.

Hmm, this file is only meant to be included inside one particular
function. Making it standalone includable would seem to be unnecessary.
What should we do?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Peter Eisentraut (#7)

Re: Add support for automatically updating Unicode derived files

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 2020-01-15 01:37, Tom Lane wrote:

This patch is making src/tools/pginclude/headerscheck unhappy:
./src/include/common/unicode_combining_table.h:3: error: array type has incomplete element type

Hmm, this file is only meant to be included inside one particular
function. Making it standalone includable would seem to be unnecessary.
What should we do?

Well, we could make it a documented exception in headerscheck and
cpluspluscheck.

regards, tom lane

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#8)

Re: Add support for automatically updating Unicode derived files

On 2020-01-20 16:43, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 2020-01-15 01:37, Tom Lane wrote:

This patch is making src/tools/pginclude/headerscheck unhappy:
./src/include/common/unicode_combining_table.h:3: error: array type has incomplete element type

Hmm, this file is only meant to be included inside one particular
function. Making it standalone includable would seem to be unnecessary.
What should we do?

Well, we could make it a documented exception in headerscheck and
cpluspluscheck.

OK, done.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 5 years ago

In reply to: Peter Eisentraut (#5)

Re: Add support for automatically updating Unicode derived files

I have committed the first Unicode data update using this new "make
update-unicode" facility.

CLDR is released regularly every 6 months, so around this time every
year would be the appropriate time to pull in the latest updates in
preparation for our own release.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services