improvements in Unicode tables generation code

Started by Peter Eisentrautover 4 years ago8 messages

peter.eisentraut@enterprisedb.com

over 4 years ago

5 attachment(s)

I have accumulated a few patches to improve the output of the scripts in
src/backend/utils/mb/Unicode/ to be less non-standard-looking and fix a
few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make. The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch

This mainly just helps eyeball the output while debugging the previous
patch.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

v1-0005-Fix-indentation-in-generated-output.patch

This changes the indentation in the output from two spaces to a tab.

I haven't included the actual output changes in the last patch, because
they would be huge, but the idea should be clear.

All together, these make the output look closer to how pgindent would
make it.

Attachments:

v1-0001-Make-Unicode-makefile-more-parallel-safe.patchtext/plain; charset=UTF-8; name=v1-0001-Make-Unicode-makefile-more-parallel-safe.patch; x-mac-creator=0; x-mac-type=0Download

From 3dce99c8e57aec91db85965b6cef947484c00a5e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 22 Jun 2021 09:06:28 +0200
Subject: [PATCH v1 1/5] Make Unicode makefile more parallel-safe

---
 src/backend/utils/mb/Unicode/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index ed6fc07e08..4969b8c385 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -72,7 +72,9 @@ GENERICTEXTS = $(ISO8859TEXTS) $(WINTEXTS) \
 
 all: $(MAPS)
 
-$(GENERICMAPS): UCS_to_most.pl $(GENERICTEXTS)
+$(wordlist 2, $(words $(GENERICMAPS)), $(GENERICMAPS)): $(firstword $(GENERICMAPS)) ;
+
+$(firstword $(GENERICMAPS)): UCS_to_most.pl $(GENERICTEXTS)
 	$(PERL) -I $(srcdir) $<
 
 johab_to_utf8.map utf8_to_johab.map: UCS_to_JOHAB.pl JOHAB.TXT
-- 
2.32.0

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patchtext/plain; charset=UTF-8; name=v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch; x-mac-creator=0; x-mac-type=0Download

From 291f0acfd5331ab2f29710b156c40b0dad703ca2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 22 Jun 2021 09:06:28 +0200
Subject: [PATCH v1 2/5] Make UCS_to_most.pl process encodings in sorted order

This just makes the progress output easier to follow.
---
 src/backend/utils/mb/Unicode/UCS_to_most.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/mb/Unicode/UCS_to_most.pl b/src/backend/utils/mb/Unicode/UCS_to_most.pl
index 4f974388d7..6b699b376d 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_most.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_most.pl
@@ -54,7 +54,8 @@
 # make maps for all encodings if not specified
 my @charsets = (scalar(@ARGV) > 0) ? @ARGV : sort keys(%filename);
 
-foreach my $charset (@charsets)
+# the sort is just so that the output is easier to eyeball
+foreach my $charset (sort @charsets)
 {
 	my $mapping = &read_source($filename{$charset});
 
-- 
2.32.0

v1-0003-Remove-some-whitespace-in-generated-C-output.patchtext/plain; charset=UTF-8; name=v1-0003-Remove-some-whitespace-in-generated-C-output.patch; x-mac-creator=0; x-mac-type=0Download

From 242440c79a10aab92e1293e1441dc963fd26e2ce Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 22 Jun 2021 09:06:28 +0200
Subject: [PATCH v1 3/5] Remove some whitespace in generated C output

It doesn't match the normal coding style.
---
 src/backend/utils/mb/Unicode/convutils.pm               | 4 ++--
 src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map   | 2 +-
 src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map | 2 +-
 src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map   | 2 +-
 src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/convutils.pm b/src/backend/utils/mb/Unicode/convutils.pm
index 5ad38514be..8369e91b2d 100644
--- a/src/backend/utils/mb/Unicode/convutils.pm
+++ b/src/backend/utils/mb/Unicode/convutils.pm
@@ -173,7 +173,7 @@ sub print_from_utf8_combined_map
 
 	printf $out "\n/* Combined character map */\n";
 	printf $out
-	  "static const pg_utf_to_local_combined ULmap${charset}_combined[ %d ] = {",
+	  "static const pg_utf_to_local_combined ULmap${charset}_combined[%d] = {\n",
 	  scalar(@$table);
 	my $first = 1;
 	foreach my $i (sort { $a->{utf8} <=> $b->{utf8} } @$table)
@@ -208,7 +208,7 @@ sub print_to_utf8_combined_map
 
 	printf $out "\n/* Combined character map */\n";
 	printf $out
-	  "static const pg_local_to_utf_combined LUmap${charset}_combined[ %d ] = {",
+	  "static const pg_local_to_utf_combined LUmap${charset}_combined[%d] = {\n",
 	  scalar(@$table);
 
 	my $first = 1;
diff --git a/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map b/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map
index d2da4a383b..3a8fc9d26f 100644
--- a/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map
@@ -3414,7 +3414,7 @@ static const uint32 euc_jis_2004_to_unicode_tree_table[11727] =
 };
 
 /* Combined character map */
-static const pg_local_to_utf_combined LUmapEUC_JIS_2004_combined[ 25 ] = {
+static const pg_local_to_utf_combined LUmapEUC_JIS_2004_combined[25] = {
   {0xa4f7, 0x00e3818b, 0x00e3829a},	/* U+304B+309A	 	[2000] */
   {0xa4f8, 0x00e3818d, 0x00e3829a},	/* U+304D+309A	 	[2000] */
   {0xa4f9, 0x00e3818f, 0x00e3829a},	/* U+304F+309A	 	[2000] */
diff --git a/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map b/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map
index e591a1135b..3c107cbb7b 100644
--- a/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map
@@ -3205,7 +3205,7 @@ static const uint32 shift_jis_2004_to_unicode_tree_table[11716] =
 };
 
 /* Combined character map */
-static const pg_local_to_utf_combined LUmapSHIFT_JIS_2004_combined[ 25 ] = {
+static const pg_local_to_utf_combined LUmapSHIFT_JIS_2004_combined[25] = {
   {0x82f5, 0x00e3818b, 0x00e3829a},	/* U+304B+309A	 	[2000] */
   {0x82f6, 0x00e3818d, 0x00e3829a},	/* U+304D+309A	 	[2000] */
   {0x82f7, 0x00e3818f, 0x00e3829a},	/* U+304F+309A	 	[2000] */
diff --git a/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map b/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map
index fa90f3958f..0d47463805 100644
--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map
@@ -12538,7 +12538,7 @@ static const uint32 euc_jis_2004_from_unicode_tree_table[39163] =
 };
 
 /* Combined character map */
-static const pg_utf_to_local_combined ULmapEUC_JIS_2004_combined[ 25 ] = {
+static const pg_utf_to_local_combined ULmapEUC_JIS_2004_combined[25] = {
   {0x0000c3a6, 0x0000cc80, 0xabc4},	/* U+00E6+0300	 	[2000] */
   {0x0000c994, 0x0000cc80, 0xabc8},	/* U+0254+0300	 	[2000] */
   {0x0000c994, 0x0000cc81, 0xabc9},	/* U+0254+0301	 	[2000] */
diff --git a/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map b/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map
index b756b5f157..202ebb25c1 100644
--- a/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map
@@ -7656,7 +7656,7 @@ static const uint16 shift_jis_2004_from_unicode_tree_table[39196] =
 };
 
 /* Combined character map */
-static const pg_utf_to_local_combined ULmapSHIFT_JIS_2004_combined[ 25 ] = {
+static const pg_utf_to_local_combined ULmapSHIFT_JIS_2004_combined[25] = {
   {0x0000c3a6, 0x0000cc80, 0x8663},	/* U+00E6+0300	 	[2000] */
   {0x0000c994, 0x0000cc80, 0x8667},	/* U+0254+0300	 	[2000] */
   {0x0000c994, 0x0000cc81, 0x8668},	/* U+0254+0301	 	[2000] */
-- 
2.32.0

v1-0004-Simplify-code-generation-code.patchtext/plain; charset=UTF-8; name=v1-0004-Simplify-code-generation-code.patch; x-mac-creator=0; x-mac-type=0Download

From 9dd3a3be9cb08fd05db26acbbe4800f4bf82e5ab Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 22 Jun 2021 09:06:28 +0200
Subject: [PATCH v1 4/5] Simplify code generation code

convutils.pm spent a fair amount of effort to avoid printing out a
trailing comma in an array initializer, which isn't actually
necessary.  We can simplify that code.  This also makes the generated
code look better indented.
---
 src/backend/utils/mb/Unicode/convutils.pm     | 37 +++++++------------
 .../utils/mb/Unicode/euc_jis_2004_to_utf8.map |  2 +-
 .../mb/Unicode/shift_jis_2004_to_utf8.map     |  2 +-
 .../utils/mb/Unicode/utf8_to_euc_jis_2004.map |  2 +-
 .../mb/Unicode/utf8_to_shift_jis_2004.map     |  2 +-
 5 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/convutils.pm b/src/backend/utils/mb/Unicode/convutils.pm
index 8369e91b2d..9a230e6dfe 100644
--- a/src/backend/utils/mb/Unicode/convutils.pm
+++ b/src/backend/utils/mb/Unicode/convutils.pm
@@ -169,34 +169,29 @@ sub print_from_utf8_combined_map
 {
 	my ($out, $charset, $table, $verbose) = @_;
 
-	my $last_comment = "";
-
 	printf $out "\n/* Combined character map */\n";
 	printf $out
 	  "static const pg_utf_to_local_combined ULmap${charset}_combined[%d] = {\n",
 	  scalar(@$table);
-	my $first = 1;
 	foreach my $i (sort { $a->{utf8} <=> $b->{utf8} } @$table)
 	{
-		print($out ",") if (!$first);
-		$first = 0;
-		print $out "\t/* $last_comment */"
-		  if ($verbose && $last_comment ne "");
+		my $comment;
 
-		printf $out "\n  {0x%08x, 0x%08x, 0x%04x}",
+		printf $out "  {0x%08x, 0x%08x, 0x%04x},",
 		  $i->{utf8}, $i->{utf8_second}, $i->{code};
 		if ($verbose >= 2)
 		{
-			$last_comment =
+			$comment =
 			  sprintf("%s:%d %s", $i->{f}, $i->{l}, $i->{comment});
 		}
 		elsif ($verbose >= 1)
 		{
-			$last_comment = $i->{comment};
+			$comment = $i->{comment};
 		}
+		print $out "\t/* $comment */" if $comment;
+		print $out "\n";
 	}
-	print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");
-	print $out "\n};\n";
+	print $out "};\n";
 	return;
 }
 
@@ -204,8 +199,6 @@ sub print_to_utf8_combined_map
 {
 	my ($out, $charset, $table, $verbose) = @_;
 
-	my $last_comment = "";
-
 	printf $out "\n/* Combined character map */\n";
 	printf $out
 	  "static const pg_local_to_utf_combined LUmap${charset}_combined[%d] = {\n",
@@ -214,26 +207,24 @@ sub print_to_utf8_combined_map
 	my $first = 1;
 	foreach my $i (sort { $a->{code} <=> $b->{code} } @$table)
 	{
-		print($out ",") if (!$first);
-		$first = 0;
-		print $out "\t/* $last_comment */"
-		  if ($verbose && $last_comment ne "");
+		my $comment;
 
-		printf $out "\n  {0x%04x, 0x%08x, 0x%08x}",
+		printf $out "  {0x%04x, 0x%08x, 0x%08x},",
 		  $i->{code}, $i->{utf8}, $i->{utf8_second};
 
 		if ($verbose >= 2)
 		{
-			$last_comment =
+			$comment =
 			  sprintf("%s:%d %s", $i->{f}, $i->{l}, $i->{comment});
 		}
 		elsif ($verbose >= 1)
 		{
-			$last_comment = $i->{comment};
+			$comment = $i->{comment};
 		}
+		print $out "\t/* $comment */" if $comment;
+		print $out "\n";
 	}
-	print $out "\t/* $last_comment */" if ($verbose && $last_comment ne "");
-	print $out "\n};\n";
+	print $out "};\n";
 	return;
 }
 
diff --git a/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map b/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map
index 3a8fc9d26f..7096fbb263 100644
--- a/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jis_2004_to_utf8.map
@@ -3439,5 +3439,5 @@ static const pg_local_to_utf_combined LUmapEUC_JIS_2004_combined[25] = {
   {0xabce, 0x0000c99a, 0x0000cc80},	/* U+025A+0300	 	[2000] */
   {0xabcf, 0x0000c99a, 0x0000cc81},	/* U+025A+0301	 	[2000] */
   {0xabe5, 0x0000cba9, 0x0000cba5},	/* U+02E9+02E5	 	[2000] */
-  {0xabe6, 0x0000cba5, 0x0000cba9}	/* U+02E5+02E9	 	[2000] */
+  {0xabe6, 0x0000cba5, 0x0000cba9},	/* U+02E5+02E9	 	[2000] */
 };
diff --git a/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map b/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map
index 3c107cbb7b..cd0bd7a452 100644
--- a/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/shift_jis_2004_to_utf8.map
@@ -3230,5 +3230,5 @@ static const pg_local_to_utf_combined LUmapSHIFT_JIS_2004_combined[25] = {
   {0x866d, 0x0000c99a, 0x0000cc80},	/* U+025A+0300	 	[2000] */
   {0x866e, 0x0000c99a, 0x0000cc81},	/* U+025A+0301	 	[2000] */
   {0x8685, 0x0000cba9, 0x0000cba5},	/* U+02E9+02E5	 	[2000] */
-  {0x8686, 0x0000cba5, 0x0000cba9}	/* U+02E5+02E9	 	[2000] */
+  {0x8686, 0x0000cba5, 0x0000cba9},	/* U+02E5+02E9	 	[2000] */
 };
diff --git a/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map b/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map
index 0d47463805..3de9d6360d 100644
--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jis_2004.map
@@ -12563,5 +12563,5 @@ static const pg_utf_to_local_combined ULmapEUC_JIS_2004_combined[25] = {
   {0x00e382bb, 0x00e3829a, 0xa5fc},	/* U+30BB+309A	 	[2000] */
   {0x00e38384, 0x00e3829a, 0xa5fd},	/* U+30C4+309A	 	[2000] */
   {0x00e38388, 0x00e3829a, 0xa5fe},	/* U+30C8+309A	 	[2000] */
-  {0x00e387b7, 0x00e3829a, 0xa6f8}	/* U+31F7+309A	 	[2000] */
+  {0x00e387b7, 0x00e3829a, 0xa6f8},	/* U+31F7+309A	 	[2000] */
 };
diff --git a/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map b/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map
index 202ebb25c1..924ccc114e 100644
--- a/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_shift_jis_2004.map
@@ -7681,5 +7681,5 @@ static const pg_utf_to_local_combined ULmapSHIFT_JIS_2004_combined[25] = {
   {0x00e382bb, 0x00e3829a, 0x839c},	/* U+30BB+309A	 	[2000] */
   {0x00e38384, 0x00e3829a, 0x839d},	/* U+30C4+309A	 	[2000] */
   {0x00e38388, 0x00e3829a, 0x839e},	/* U+30C8+309A	 	[2000] */
-  {0x00e387b7, 0x00e3829a, 0x83f6}	/* U+31F7+309A	 	[2000] */
+  {0x00e387b7, 0x00e3829a, 0x83f6},	/* U+31F7+309A	 	[2000] */
 };
-- 
2.32.0

v1-0005-Fix-indentation-in-generated-output.patchtext/plain; charset=UTF-8; name=v1-0005-Fix-indentation-in-generated-output.patch; x-mac-creator=0; x-mac-type=0Download

From 9ed1c960fe9ea5ac37fc88b4e6088bc3f2387cb4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 22 Jun 2021 09:06:28 +0200
Subject: [PATCH v1 5/5] Fix indentation in generated output

---
 src/backend/utils/mb/Unicode/convutils.pm | 66 +++++++++++------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/convutils.pm b/src/backend/utils/mb/Unicode/convutils.pm
index 9a230e6dfe..4182910f94 100644
--- a/src/backend/utils/mb/Unicode/convutils.pm
+++ b/src/backend/utils/mb/Unicode/convutils.pm
@@ -177,7 +177,7 @@ sub print_from_utf8_combined_map
 	{
 		my $comment;
 
-		printf $out "  {0x%08x, 0x%08x, 0x%04x},",
+		printf $out "\t{0x%08x, 0x%08x, 0x%04x},",
 		  $i->{utf8}, $i->{utf8_second}, $i->{code};
 		if ($verbose >= 2)
 		{
@@ -209,7 +209,7 @@ sub print_to_utf8_combined_map
 	{
 		my $comment;
 
-		printf $out "  {0x%04x, 0x%08x, 0x%08x},",
+		printf $out "\t{0x%04x, 0x%08x, 0x%08x},",
 		  $i->{code}, $i->{utf8}, $i->{utf8_second};
 
 		if ($verbose >= 2)
@@ -540,46 +540,46 @@ sub print_radix_table
 	printf $out "{\n";
 	if ($datatype eq "uint16")
 	{
-		print $out "  ${tblname}_table,\n";
-		print $out "  NULL, /* 32-bit table not used */\n";
+		print $out "\t${tblname}_table,\n";
+		print $out "\tNULL, /* 32-bit table not used */\n";
 	}
 	if ($datatype eq "uint32")
 	{
-		print $out "  NULL, /* 16-bit table not used */\n";
-		print $out "  ${tblname}_table,\n";
+		print $out "\tNULL, /* 16-bit table not used */\n";
+		print $out "\t${tblname}_table,\n";
 	}
 	printf $out "\n";
-	printf $out "  0x%04x, /* offset of table for 1-byte inputs */\n",
+	printf $out "\t0x%04x, /* offset of table for 1-byte inputs */\n",
 	  $b1root;
-	printf $out "  0x%02x, /* b1_lower */\n", $b1_lower;
-	printf $out "  0x%02x, /* b1_upper */\n", $b1_upper;
+	printf $out "\t0x%02x, /* b1_lower */\n", $b1_lower;
+	printf $out "\t0x%02x, /* b1_upper */\n", $b1_upper;
 	printf $out "\n";
-	printf $out "  0x%04x, /* offset of table for 2-byte inputs */\n",
+	printf $out "\t0x%04x, /* offset of table for 2-byte inputs */\n",
 	  $b2root;
-	printf $out "  0x%02x, /* b2_1_lower */\n", $b2_1_lower;
-	printf $out "  0x%02x, /* b2_1_upper */\n", $b2_1_upper;
-	printf $out "  0x%02x, /* b2_2_lower */\n", $b2_2_lower;
-	printf $out "  0x%02x, /* b2_2_upper */\n", $b2_2_upper;
+	printf $out "\t0x%02x, /* b2_1_lower */\n", $b2_1_lower;
+	printf $out "\t0x%02x, /* b2_1_upper */\n", $b2_1_upper;
+	printf $out "\t0x%02x, /* b2_2_lower */\n", $b2_2_lower;
+	printf $out "\t0x%02x, /* b2_2_upper */\n", $b2_2_upper;
 	printf $out "\n";
-	printf $out "  0x%04x, /* offset of table for 3-byte inputs */\n",
+	printf $out "\t0x%04x, /* offset of table for 3-byte inputs */\n",
 	  $b3root;
-	printf $out "  0x%02x, /* b3_1_lower */\n", $b3_1_lower;
-	printf $out "  0x%02x, /* b3_1_upper */\n", $b3_1_upper;
-	printf $out "  0x%02x, /* b3_2_lower */\n", $b3_2_lower;
-	printf $out "  0x%02x, /* b3_2_upper */\n", $b3_2_upper;
-	printf $out "  0x%02x, /* b3_3_lower */\n", $b3_3_lower;
-	printf $out "  0x%02x, /* b3_3_upper */\n", $b3_3_upper;
+	printf $out "\t0x%02x, /* b3_1_lower */\n", $b3_1_lower;
+	printf $out "\t0x%02x, /* b3_1_upper */\n", $b3_1_upper;
+	printf $out "\t0x%02x, /* b3_2_lower */\n", $b3_2_lower;
+	printf $out "\t0x%02x, /* b3_2_upper */\n", $b3_2_upper;
+	printf $out "\t0x%02x, /* b3_3_lower */\n", $b3_3_lower;
+	printf $out "\t0x%02x, /* b3_3_upper */\n", $b3_3_upper;
 	printf $out "\n";
-	printf $out "  0x%04x, /* offset of table for 3-byte inputs */\n",
+	printf $out "\t0x%04x, /* offset of table for 3-byte inputs */\n",
 	  $b4root;
-	printf $out "  0x%02x, /* b4_1_lower */\n", $b4_1_lower;
-	printf $out "  0x%02x, /* b4_1_upper */\n", $b4_1_upper;
-	printf $out "  0x%02x, /* b4_2_lower */\n", $b4_2_lower;
-	printf $out "  0x%02x, /* b4_2_upper */\n", $b4_2_upper;
-	printf $out "  0x%02x, /* b4_3_lower */\n", $b4_3_lower;
-	printf $out "  0x%02x, /* b4_3_upper */\n", $b4_3_upper;
-	printf $out "  0x%02x, /* b4_4_lower */\n", $b4_4_lower;
-	printf $out "  0x%02x  /* b4_4_upper */\n", $b4_4_upper;
+	printf $out "\t0x%02x, /* b4_1_lower */\n", $b4_1_lower;
+	printf $out "\t0x%02x, /* b4_1_upper */\n", $b4_1_upper;
+	printf $out "\t0x%02x, /* b4_2_lower */\n", $b4_2_lower;
+	printf $out "\t0x%02x, /* b4_2_upper */\n", $b4_2_upper;
+	printf $out "\t0x%02x, /* b4_3_lower */\n", $b4_3_lower;
+	printf $out "\t0x%02x, /* b4_3_upper */\n", $b4_3_upper;
+	printf $out "\t0x%02x, /* b4_4_lower */\n", $b4_4_lower;
+	printf $out "\t0x%02x  /* b4_4_upper */\n", $b4_4_upper;
 	print $out "};\n";
 	print $out "\n";
 	print $out "static const $datatype ${tblname}_table[$tblsize] =\n";
@@ -589,7 +589,7 @@ sub print_radix_table
 	foreach my $seg (@segments)
 	{
 		printf $out "\n";
-		printf $out "  /*** %s - offset 0x%05x ***/\n", $seg->{header}, $off;
+		printf $out "\t/*** %s - offset 0x%05x ***/\n", $seg->{header}, $off;
 		printf $out "\n";
 
 		for (my $i = $seg->{min_idx}; $i <= $seg->{max_idx};)
@@ -597,7 +597,7 @@ sub print_radix_table
 
 			# Print the next line's worth of values.
 			# XXX pad to begin at a nice boundary
-			printf $out "  /* %02x */ ", $i;
+			printf $out "\t/* %02x */", $i;
 			for (my $j = 0;
 				$j < $vals_per_line && $i <= $seg->{max_idx}; $j++)
 			{
@@ -617,7 +617,7 @@ sub print_radix_table
 		if ($seg->{overlaid_trail_zeros})
 		{
 			printf $out
-			  "    /* $seg->{overlaid_trail_zeros} trailing zero values shared with next segment */\n";
+			  "\t/* $seg->{overlaid_trail_zeros} trailing zero values shared with next segment */\n";
 		}
 	}
 
-- 
2.32.0

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Peter Eisentraut (#1)

Re: improvements in Unicode tables generation code

At Tue, 22 Jun 2021 09:20:16 +0200, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote in

I have accumulated a few patches to improve the output of the scripts
in src/backend/utils/mb/Unicode/ to be less non-standard-looking and
fix a few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly
for parallel make. The script writes all output files in one go, but
the rule as written would call the command once for each output file
in parallel.

I was annoyed by that behavior but haven't found how to stop that. It
looks to work. (But I haven't run it for me for the reason at the end
of this mail.)

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch

This mainly just helps eyeball the output while debugging the previous
patch.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

These look just fine.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

This simplifies the code in exchange of allowing a comma after the
last element of array literals. I'm fine with it as long as we allow
that style in the tree.

v1-0005-Fix-indentation-in-generated-output.patch

This changes the indentation in the output from two spaces to a tab.

I haven't included the actual output changes in the last patch,
because they would be huge, but the idea should be clear.

All together, these make the output look closer to how pgindent would
make it.

I agree to the fix.

Mmm. (although, somewhat unrelated to this patch set) I tried this but
I found that www.unicode.org doesn't respond (for at least these
several days). I'm not sure what is happening here.

wget -O 8859-2.TXT --no-use-server-timestamps https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT
--2021-06-22 17:09:34-- https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT
Resolving www.unicode.org (www.unicode.org)... 66.34.208.12
Connecting to www.unicode.org (www.unicode.org)|66.34.208.12|:443...

(timeouts)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Heikki Linnakangas

hlinnaka@iki.fi

over 4 years ago

In reply to: Peter Eisentraut (#1)

Re: improvements in Unicode tables generation code

On 22/06/2021 10:20, Peter Eisentraut wrote:

I have accumulated a few patches to improve the output of the scripts in
src/backend/utils/mb/Unicode/ to be less non-standard-looking and fix a
few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make. The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

This could use a comment. At a quick glance, I don't understand what all
the $(wordlist ...) magic does.

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch

This mainly just helps eyeball the output while debugging the previous
patch.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

I'm surprised the added \n in the perl code didn't result in extra
newlines in the outputs.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

If we do that, let's add the trailing commas to the other arrays too,
not just the combined maps.

No objection, but how does this help the next patch?

If we want to avoid the stray commas (and I think they are a little
ugly, but that's a matter of taste), we could adopt the approach that
print_radix_table() uses to avoid the comma. That seems simpler than
what print_from_utf8_combined_map and print_to_utf8_combined_map are doing.

v1-0005-Fix-indentation-in-generated-output.patch

This changes the indentation in the output from two spaces to a tab.

I haven't included the actual output changes in the last patch, because
they would be huge, but the idea should be clear.

All together, these make the output look closer to how pgindent would
make it.

Thanks!

- Heikki

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Heikki Linnakangas (#3)

Re: improvements in Unicode tables generation code

At Tue, 22 Jun 2021 11:20:46 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in

On 22/06/2021 10:20, Peter Eisentraut wrote:

v1-0004-Simplify-code-generation-code.patch
This simplifies the code a bit, which helps with the next patch.

If we do that, let's add the trailing commas to the other arrays too,
not just the combined maps.

No objection, but how does this help the next patch?

If we want to avoid the stray commas (and I think they are a little
ugly, but that's a matter of taste), we could adopt the approach that
print_radix_table() uses to avoid the comma. That seems simpler than
what print_from_utf8_combined_map and print_to_utf8_combined_map are
doing.

+1 for adopting the same method with print_radix_table *if* we do want
to avoid the stray commans.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 4 years ago

In reply to: Heikki Linnakangas (#3)

Re: improvements in Unicode tables generation code

On 22.06.21 10:20, Heikki Linnakangas wrote:

On 22/06/2021 10:20, Peter Eisentraut wrote:

I have accumulated a few patches to improve the output of the scripts in
src/backend/utils/mb/Unicode/ to be less non-standard-looking and fix a
few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make. The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

This could use a comment. At a quick glance, I don't understand what all
the $(wordlist ...) magic does.

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time.
I'll take another look at that.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

I'm surprised the added \n in the perl code didn't result in extra
newlines in the outputs.

True, I'll have to check that again. I suspect that \n actually belongs
to patch 0004.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

If we do that, let's add the trailing commas to the other arrays too,
not just the combined maps.

No objection, but how does this help the next patch?

Mainly it just moves things around so that each print normally starts at
the beginning of a line and concludes with a \n.

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 4 years ago

In reply to: Peter Eisentraut (#5)

1 attachment(s)

Re: improvements in Unicode tables generation code

On 23.06.21 10:55, Peter Eisentraut wrote:

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make. The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

This could use a comment. At a quick glance, I don't understand what
all the $(wordlist ...) magic does.

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time. I'll
take another look at that.

Here is a patch that does it one file (pair) at a time. The other rules
besides UCS_to_most.pl actually had the same problem, since they produce
two output files, so running in parallel called each script twice. In
this patch, all of that is heavily refactored and works correctly now.
Note that UCS_to_most.pl already accepted a command-line argument to
specify which encoding to work with.

Attachments:

0001-Make-Unicode-makefile-parallel-safe.patchtext/plain; charset=UTF-8; name=0001-Make-Unicode-makefile-parallel-safe.patch; x-mac-creator=0; x-mac-type=0Download

From 974720b0b6c92f42506ae37d8e88368ba279b973 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 20 Jul 2021 13:47:13 +0200
Subject: [PATCH] Make Unicode makefile parallel-safe

Fix the rules so that each rule is parallel safe, using the same
trickery that we use elsewhere in the tree for rules that produce more
than one output file.  Refactor the whole makefile so that there is
less repetition.
---
 src/backend/utils/mb/Unicode/Makefile | 134 +++++++++-----------------
 1 file changed, 45 insertions(+), 89 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index ed6fc07e08..cdcde4fbbd 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -12,101 +12,57 @@ subdir = src/backend/utils/mb/Unicode
 top_builddir = ../../../../..
 include $(top_builddir)/src/Makefile.global
 
-ISO8859MAPS = iso8859_2_to_utf8.map utf8_to_iso8859_2.map \
-	iso8859_3_to_utf8.map utf8_to_iso8859_3.map \
-	iso8859_4_to_utf8.map utf8_to_iso8859_4.map \
-	iso8859_5_to_utf8.map utf8_to_iso8859_5.map \
-	iso8859_6_to_utf8.map utf8_to_iso8859_6.map \
-	iso8859_7_to_utf8.map utf8_to_iso8859_7.map \
-	iso8859_8_to_utf8.map utf8_to_iso8859_8.map \
-	iso8859_9_to_utf8.map utf8_to_iso8859_9.map \
-	iso8859_10_to_utf8.map utf8_to_iso8859_10.map \
-	iso8859_13_to_utf8.map utf8_to_iso8859_13.map \
-	iso8859_14_to_utf8.map utf8_to_iso8859_14.map \
-	iso8859_15_to_utf8.map utf8_to_iso8859_15.map \
-	iso8859_16_to_utf8.map utf8_to_iso8859_16.map
-
-WINMAPS = win866_to_utf8.map utf8_to_win866.map \
-	win874_to_utf8.map utf8_to_win874.map \
-	win1250_to_utf8.map utf8_to_win1250.map \
-	win1251_to_utf8.map utf8_to_win1251.map \
-	win1252_to_utf8.map utf8_to_win1252.map \
-	win1253_to_utf8.map utf8_to_win1253.map \
-	win1254_to_utf8.map utf8_to_win1254.map \
-	win1255_to_utf8.map utf8_to_win1255.map \
-	win1256_to_utf8.map utf8_to_win1256.map \
-	win1257_to_utf8.map utf8_to_win1257.map \
-	win1258_to_utf8.map utf8_to_win1258.map
-
-GENERICMAPS = $(ISO8859MAPS) $(WINMAPS) \
-	gbk_to_utf8.map utf8_to_gbk.map \
-	koi8r_to_utf8.map utf8_to_koi8r.map \
-	koi8u_to_utf8.map utf8_to_koi8u.map
-
-SPECIALMAPS = euc_cn_to_utf8.map utf8_to_euc_cn.map \
-	euc_jp_to_utf8.map utf8_to_euc_jp.map \
-	euc_kr_to_utf8.map utf8_to_euc_kr.map \
-	euc_tw_to_utf8.map utf8_to_euc_tw.map \
-	sjis_to_utf8.map utf8_to_sjis.map \
-	gb18030_to_utf8.map utf8_to_gb18030.map \
-	big5_to_utf8.map utf8_to_big5.map \
-	johab_to_utf8.map utf8_to_johab.map \
-	uhc_to_utf8.map utf8_to_uhc.map \
-	euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map \
-	shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map
-
-MAPS = $(GENERICMAPS) $(SPECIALMAPS)
-
-ISO8859TEXTS = 8859-2.TXT 8859-3.TXT 8859-4.TXT 8859-5.TXT \
-	8859-6.TXT 8859-7.TXT 8859-8.TXT 8859-9.TXT \
-	8859-10.TXT 8859-13.TXT 8859-14.TXT 8859-15.TXT \
-	8859-16.TXT
-
-WINTEXTS = CP866.TXT CP874.TXT CP936.TXT \
-	CP1250.TXT CP1251.TXT \
-	CP1252.TXT CP1253.TXT CP1254.TXT CP1255.TXT \
-	CP1256.TXT CP1257.TXT CP1258.TXT
-
-GENERICTEXTS = $(ISO8859TEXTS) $(WINTEXTS) \
-	KOI8-R.TXT KOI8-U.TXT
 
-all: $(MAPS)
-
-$(GENERICMAPS): UCS_to_most.pl $(GENERICTEXTS)
-	$(PERL) -I $(srcdir) $<
-
-johab_to_utf8.map utf8_to_johab.map: UCS_to_JOHAB.pl JOHAB.TXT
-	$(PERL) -I $(srcdir) $<
-
-uhc_to_utf8.map utf8_to_uhc.map: UCS_to_UHC.pl windows-949-2000.xml
-	$(PERL) -I $(srcdir) $<
-
-euc_jp_to_utf8.map utf8_to_euc_jp.map: UCS_to_EUC_JP.pl CP932.TXT JIS0212.TXT
-	$(PERL) -I $(srcdir) $<
+# Define a rule to create to map files from downloaded text input
+# files using a script.  Arguments:
+#
+# 1: encoding name used in output files (lower case)
+# 2: script name
+# 3: input text files
+# 4: argument to pass to script (optional)
+#
+# We also collect all the input and output files in variables to
+# define the build and clean rules below.
+#
+# Note that while each script call produces two output files, to be
+# parallel-make safe we need to split this into two rules.  (See for
+# example gram.y for more explanation.)
+#
+define map_rule
+MAPS += $(1)_to_utf8.map utf8_to_$(1).map
+ALL_TEXTS += $(3)
 
-euc_cn_to_utf8.map utf8_to_euc_cn.map: UCS_to_EUC_CN.pl gb-18030-2000.xml
-	$(PERL) -I $(srcdir) $<
+$(1)_to_utf8.map: $(2) $(3)
+	$(PERL) -I $$(srcdir) $$< $(4)
 
-euc_kr_to_utf8.map utf8_to_euc_kr.map: UCS_to_EUC_KR.pl KSX1001.TXT
-	$(PERL) -I $(srcdir) $<
+utf8_to_$(1).map: $(1)_to_utf8.map
+	@touch $$@
+endef
 
-euc_tw_to_utf8.map utf8_to_euc_tw.map: UCS_to_EUC_TW.pl CNS11643.TXT
-	$(PERL) -I $(srcdir) $<
+$(foreach n,2 3 4 5 6 7 8 9 10 13 14 15 16,$(eval $(call map_rule,iso8859_$(n),UCS_to_most.pl,8859-$(n).TXT,ISO8859_$(n))))
 
-sjis_to_utf8.map utf8_to_sjis.map: UCS_to_SJIS.pl CP932.TXT
-	$(PERL) -I $(srcdir) $<
+$(foreach n,866 874 1250 1251 1252 1253 1254 1255 1256 1257 1258,$(eval $(call map_rule,win$(n),UCS_to_most.pl,CP$(n).TXT,WIN$(n))))
 
-gb18030_to_utf8.map utf8_to_gb18030.map: UCS_to_GB18030.pl gb-18030-2000.xml
-	$(PERL) -I $(srcdir) $<
+$(eval $(call map_rule,koi8r,UCS_to_most.pl,KOI8-R.TXT,KOI8R))
+$(eval $(call map_rule,koi8u,UCS_to_most.pl,KOI8-U.TXT,KOI8U))
+$(eval $(call map_rule,gbk,UCS_to_most.pl,CP936.TXT,GBK))
 
-big5_to_utf8.map utf8_to_big5.map: UCS_to_BIG5.pl BIG5.TXT CP950.TXT
-	$(PERL) -I $(srcdir) $<
+$(eval $(call map_rule,johab,UCS_to_JOHAB.pl,JOHAB.TXT))
+$(eval $(call map_rule,uhc,UCS_to_UHC.pl,windows-949-2000.xml))
+$(eval $(call map_rule,euc_jp,UCS_to_EUC_JP.pl,CP932.TXT JIS0212.TXT))
+$(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb-18030-2000.xml))
+$(eval $(call map_rule,euc_kr,UCS_to_EUC_KR.pl,KSX1001.TXT))
+$(eval $(call map_rule,euc_tw,UCS_to_EUC_TW.pl,CNS11643.TXT))
+$(eval $(call map_rule,sjis,UCS_to_SJIS.pl,CP932.TXT))
+$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.xml))
+$(eval $(call map_rule,big5,UCS_to_BIG5.pl,CP950.TXT BIG5.TXT CP950.TXT))
+$(eval $(call map_rule,euc_jis_2004,UCS_to_EUC_JIS_2004.pl,euc-jis-2004-std.txt))
+$(eval $(call map_rule,shift_jis_2004,UCS_to_SHIFT_JIS_2004.pl,sjis-0213-2004-std.txt))
 
-euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map: UCS_to_EUC_JIS_2004.pl euc-jis-2004-std.txt
-	$(PERL) -I $(srcdir) $<
+# remove duplicates
+TEXTS = $(sort ALL_TEXTS)
 
-shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map: UCS_to_SHIFT_JIS_2004.pl sjis-0213-2004-std.txt
-	$(PERL) -I $(srcdir) $<
+all: $(MAPS)
 
 distclean: clean
 	rm -f $(TEXTS)
@@ -136,11 +92,11 @@ JOHAB.TXT KSX1001.TXT:
 KOI8-R.TXT KOI8-U.TXT:
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/$(@F)
 
-$(ISO8859TEXTS):
+$(filter 8859-%.TXT,$(TEXTS)):
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/ISO8859/$(@F)
 
-$(filter-out CP8%,$(WINTEXTS)) CP932.TXT CP950.TXT:
+$(filter CP9%.TXT CP12%.TXT,$(TEXTS)):
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/$(@F)
 
-$(filter CP8%,$(WINTEXTS)):
+$(filter CP8%.TXT,$(TEXTS)):
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/$(@F)
-- 
2.32.0

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 4 years ago

In reply to: Peter Eisentraut (#6)

1 attachment(s)

Re: improvements in Unicode tables generation code

On 20.07.21 13:57, Peter Eisentraut wrote:

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time.
I'll take another look at that.

Here is a patch that does it one file (pair) at a time. The other rules
besides UCS_to_most.pl actually had the same problem, since they produce
two output files, so running in parallel called each script twice. In
this patch, all of that is heavily refactored and works correctly now.
Note that UCS_to_most.pl already accepted a command-line argument to
specify which encoding to work with.

Here is an updated patch with a thinko fix that made the previous patch
not actually work.

Attachments:

v2-0001-Make-Unicode-makefile-parallel-safe.patchtext/plain; charset=UTF-8; name=v2-0001-Make-Unicode-makefile-parallel-safe.patch; x-mac-creator=0; x-mac-type=0Download

From d243f21f4f4d38db08f427556256c87681a2c831 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Tue, 28 Sep 2021 10:17:31 +0200
Subject: [PATCH v2] Make Unicode makefile parallel-safe

Fix the rules so that each rule is parallel safe, using the same
trickery that we use elsewhere in the tree for rules that produce more
than one output file.  Refactor the whole makefile so that there is
less repetition.

Discussion: https://www.postgresql.org/message-id/18e34084-aab1-1b4c-edd1-c4f9fb04f714%40enterprisedb.com
---
 src/backend/utils/mb/Unicode/Makefile | 134 +++++++++-----------------
 1 file changed, 45 insertions(+), 89 deletions(-)

diff --git a/src/backend/utils/mb/Unicode/Makefile b/src/backend/utils/mb/Unicode/Makefile
index ed6fc07e08..6e54b8f291 100644
--- a/src/backend/utils/mb/Unicode/Makefile
+++ b/src/backend/utils/mb/Unicode/Makefile
@@ -12,101 +12,57 @@ subdir = src/backend/utils/mb/Unicode
 top_builddir = ../../../../..
 include $(top_builddir)/src/Makefile.global
 
-ISO8859MAPS = iso8859_2_to_utf8.map utf8_to_iso8859_2.map \
-	iso8859_3_to_utf8.map utf8_to_iso8859_3.map \
-	iso8859_4_to_utf8.map utf8_to_iso8859_4.map \
-	iso8859_5_to_utf8.map utf8_to_iso8859_5.map \
-	iso8859_6_to_utf8.map utf8_to_iso8859_6.map \
-	iso8859_7_to_utf8.map utf8_to_iso8859_7.map \
-	iso8859_8_to_utf8.map utf8_to_iso8859_8.map \
-	iso8859_9_to_utf8.map utf8_to_iso8859_9.map \
-	iso8859_10_to_utf8.map utf8_to_iso8859_10.map \
-	iso8859_13_to_utf8.map utf8_to_iso8859_13.map \
-	iso8859_14_to_utf8.map utf8_to_iso8859_14.map \
-	iso8859_15_to_utf8.map utf8_to_iso8859_15.map \
-	iso8859_16_to_utf8.map utf8_to_iso8859_16.map
-
-WINMAPS = win866_to_utf8.map utf8_to_win866.map \
-	win874_to_utf8.map utf8_to_win874.map \
-	win1250_to_utf8.map utf8_to_win1250.map \
-	win1251_to_utf8.map utf8_to_win1251.map \
-	win1252_to_utf8.map utf8_to_win1252.map \
-	win1253_to_utf8.map utf8_to_win1253.map \
-	win1254_to_utf8.map utf8_to_win1254.map \
-	win1255_to_utf8.map utf8_to_win1255.map \
-	win1256_to_utf8.map utf8_to_win1256.map \
-	win1257_to_utf8.map utf8_to_win1257.map \
-	win1258_to_utf8.map utf8_to_win1258.map
-
-GENERICMAPS = $(ISO8859MAPS) $(WINMAPS) \
-	gbk_to_utf8.map utf8_to_gbk.map \
-	koi8r_to_utf8.map utf8_to_koi8r.map \
-	koi8u_to_utf8.map utf8_to_koi8u.map
-
-SPECIALMAPS = euc_cn_to_utf8.map utf8_to_euc_cn.map \
-	euc_jp_to_utf8.map utf8_to_euc_jp.map \
-	euc_kr_to_utf8.map utf8_to_euc_kr.map \
-	euc_tw_to_utf8.map utf8_to_euc_tw.map \
-	sjis_to_utf8.map utf8_to_sjis.map \
-	gb18030_to_utf8.map utf8_to_gb18030.map \
-	big5_to_utf8.map utf8_to_big5.map \
-	johab_to_utf8.map utf8_to_johab.map \
-	uhc_to_utf8.map utf8_to_uhc.map \
-	euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map \
-	shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map
-
-MAPS = $(GENERICMAPS) $(SPECIALMAPS)
-
-ISO8859TEXTS = 8859-2.TXT 8859-3.TXT 8859-4.TXT 8859-5.TXT \
-	8859-6.TXT 8859-7.TXT 8859-8.TXT 8859-9.TXT \
-	8859-10.TXT 8859-13.TXT 8859-14.TXT 8859-15.TXT \
-	8859-16.TXT
-
-WINTEXTS = CP866.TXT CP874.TXT CP936.TXT \
-	CP1250.TXT CP1251.TXT \
-	CP1252.TXT CP1253.TXT CP1254.TXT CP1255.TXT \
-	CP1256.TXT CP1257.TXT CP1258.TXT
-
-GENERICTEXTS = $(ISO8859TEXTS) $(WINTEXTS) \
-	KOI8-R.TXT KOI8-U.TXT
 
-all: $(MAPS)
-
-$(GENERICMAPS): UCS_to_most.pl $(GENERICTEXTS)
-	$(PERL) -I $(srcdir) $<
-
-johab_to_utf8.map utf8_to_johab.map: UCS_to_JOHAB.pl JOHAB.TXT
-	$(PERL) -I $(srcdir) $<
-
-uhc_to_utf8.map utf8_to_uhc.map: UCS_to_UHC.pl windows-949-2000.xml
-	$(PERL) -I $(srcdir) $<
-
-euc_jp_to_utf8.map utf8_to_euc_jp.map: UCS_to_EUC_JP.pl CP932.TXT JIS0212.TXT
-	$(PERL) -I $(srcdir) $<
+# Define a rule to create the map files from downloaded text input
+# files using a script.  Arguments:
+#
+# 1: encoding name used in output files (lower case)
+# 2: script name
+# 3: input text files
+# 4: argument to pass to script (optional)
+#
+# We also collect all the input and output files in variables to
+# define the build and clean rules below.
+#
+# Note that while each script call produces two output files, to be
+# parallel-make safe we need to split this into two rules.  (See for
+# example gram.y for more explanation.)
+#
+define map_rule
+MAPS += $(1)_to_utf8.map utf8_to_$(1).map
+ALL_TEXTS += $(3)
 
-euc_cn_to_utf8.map utf8_to_euc_cn.map: UCS_to_EUC_CN.pl gb-18030-2000.xml
-	$(PERL) -I $(srcdir) $<
+$(1)_to_utf8.map: $(2) $(3)
+	$(PERL) -I $$(srcdir) $$< $(4)
 
-euc_kr_to_utf8.map utf8_to_euc_kr.map: UCS_to_EUC_KR.pl KSX1001.TXT
-	$(PERL) -I $(srcdir) $<
+utf8_to_$(1).map: $(1)_to_utf8.map
+	@touch $$@
+endef
 
-euc_tw_to_utf8.map utf8_to_euc_tw.map: UCS_to_EUC_TW.pl CNS11643.TXT
-	$(PERL) -I $(srcdir) $<
+$(foreach n,2 3 4 5 6 7 8 9 10 13 14 15 16,$(eval $(call map_rule,iso8859_$(n),UCS_to_most.pl,8859-$(n).TXT,ISO8859_$(n))))
 
-sjis_to_utf8.map utf8_to_sjis.map: UCS_to_SJIS.pl CP932.TXT
-	$(PERL) -I $(srcdir) $<
+$(foreach n,866 874 1250 1251 1252 1253 1254 1255 1256 1257 1258,$(eval $(call map_rule,win$(n),UCS_to_most.pl,CP$(n).TXT,WIN$(n))))
 
-gb18030_to_utf8.map utf8_to_gb18030.map: UCS_to_GB18030.pl gb-18030-2000.xml
-	$(PERL) -I $(srcdir) $<
+$(eval $(call map_rule,koi8r,UCS_to_most.pl,KOI8-R.TXT,KOI8R))
+$(eval $(call map_rule,koi8u,UCS_to_most.pl,KOI8-U.TXT,KOI8U))
+$(eval $(call map_rule,gbk,UCS_to_most.pl,CP936.TXT,GBK))
 
-big5_to_utf8.map utf8_to_big5.map: UCS_to_BIG5.pl BIG5.TXT CP950.TXT
-	$(PERL) -I $(srcdir) $<
+$(eval $(call map_rule,johab,UCS_to_JOHAB.pl,JOHAB.TXT))
+$(eval $(call map_rule,uhc,UCS_to_UHC.pl,windows-949-2000.xml))
+$(eval $(call map_rule,euc_jp,UCS_to_EUC_JP.pl,CP932.TXT JIS0212.TXT))
+$(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb-18030-2000.xml))
+$(eval $(call map_rule,euc_kr,UCS_to_EUC_KR.pl,KSX1001.TXT))
+$(eval $(call map_rule,euc_tw,UCS_to_EUC_TW.pl,CNS11643.TXT))
+$(eval $(call map_rule,sjis,UCS_to_SJIS.pl,CP932.TXT))
+$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.xml))
+$(eval $(call map_rule,big5,UCS_to_BIG5.pl,CP950.TXT BIG5.TXT CP950.TXT))
+$(eval $(call map_rule,euc_jis_2004,UCS_to_EUC_JIS_2004.pl,euc-jis-2004-std.txt))
+$(eval $(call map_rule,shift_jis_2004,UCS_to_SHIFT_JIS_2004.pl,sjis-0213-2004-std.txt))
 
-euc_jis_2004_to_utf8.map utf8_to_euc_jis_2004.map: UCS_to_EUC_JIS_2004.pl euc-jis-2004-std.txt
-	$(PERL) -I $(srcdir) $<
+# remove duplicates
+TEXTS = $(sort $(ALL_TEXTS))
 
-shift_jis_2004_to_utf8.map utf8_to_shift_jis_2004.map: UCS_to_SHIFT_JIS_2004.pl sjis-0213-2004-std.txt
-	$(PERL) -I $(srcdir) $<
+all: $(MAPS)
 
 distclean: clean
 	rm -f $(TEXTS)
@@ -136,11 +92,11 @@ JOHAB.TXT KSX1001.TXT:
 KOI8-R.TXT KOI8-U.TXT:
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/$(@F)
 
-$(ISO8859TEXTS):
+$(filter 8859-%.TXT,$(TEXTS)):
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/ISO8859/$(@F)
 
-$(filter-out CP8%,$(WINTEXTS)) CP932.TXT CP950.TXT:
+$(filter CP9%.TXT CP12%.TXT,$(TEXTS)):
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/$(@F)
 
-$(filter CP8%,$(WINTEXTS)):
+$(filter CP8%.TXT,$(TEXTS)):
 	$(DOWNLOAD) https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/$(@F)
-- 
2.33.0

Peter Eisentraut

peter.eisentraut@enterprisedb.com

over 4 years ago

In reply to: Peter Eisentraut (#7)

Re: improvements in Unicode tables generation code

On 28.09.21 10:25, Peter Eisentraut wrote:

On 20.07.21 13:57, Peter Eisentraut wrote:

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time.
I'll take another look at that.

Here is a patch that does it one file (pair) at a time. The other
rules besides UCS_to_most.pl actually had the same problem, since they
produce two output files, so running in parallel called each script
twice. In this patch, all of that is heavily refactored and works
correctly now. Note that UCS_to_most.pl already accepted a
command-line argument to specify which encoding to work with.

Here is an updated patch with a thinko fix that made the previous patch
not actually work.

I have committed this one and closed the CF entry.