improvements in Unicode tables generation code

Started by Peter Eisentrautalmost 5 years ago8 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

I have accumulated a few patches to improve the output of the scripts in
src/backend/utils/mb/Unicode/ to be less non-standard-looking and fix a
few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make. The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch

This mainly just helps eyeball the output while debugging the previous
patch.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

v1-0005-Fix-indentation-in-generated-output.patch

This changes the indentation in the output from two spaces to a tab.

I haven't included the actual output changes in the last patch, because
they would be huge, but the idea should be clear.

All together, these make the output look closer to how pgindent would
make it.

Attachments:

v1-0001-Make-Unicode-makefile-more-parallel-safe.patchtext/plain; charset=UTF-8; name=v1-0001-Make-Unicode-makefile-more-parallel-safe.patch; x-mac-creator=0; x-mac-type=0Download+3-2
v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patchtext/plain; charset=UTF-8; name=v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch; x-mac-creator=0; x-mac-type=0Download+2-2
v1-0003-Remove-some-whitespace-in-generated-C-output.patchtext/plain; charset=UTF-8; name=v1-0003-Remove-some-whitespace-in-generated-C-output.patch; x-mac-creator=0; x-mac-type=0Download+6-7
v1-0004-Simplify-code-generation-code.patchtext/plain; charset=UTF-8; name=v1-0004-Simplify-code-generation-code.patch; x-mac-creator=0; x-mac-type=0Download+18-28
v1-0005-Fix-indentation-in-generated-output.patchtext/plain; charset=UTF-8; name=v1-0005-Fix-indentation-in-generated-output.patch; x-mac-creator=0; x-mac-type=0Download+33-34
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Peter Eisentraut (#1)
Re: improvements in Unicode tables generation code

At Tue, 22 Jun 2021 09:20:16 +0200, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote in

I have accumulated a few patches to improve the output of the scripts
in src/backend/utils/mb/Unicode/ to be less non-standard-looking and
fix a few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly
for parallel make. The script writes all output files in one go, but
the rule as written would call the command once for each output file
in parallel.

I was annoyed by that behavior but haven't found how to stop that. It
looks to work. (But I haven't run it for me for the reason at the end
of this mail.)

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch

This mainly just helps eyeball the output while debugging the previous
patch.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

These look just fine.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

This simplifies the code in exchange of allowing a comma after the
last element of array literals. I'm fine with it as long as we allow
that style in the tree.

v1-0005-Fix-indentation-in-generated-output.patch

This changes the indentation in the output from two spaces to a tab.

I haven't included the actual output changes in the last patch,
because they would be huge, but the idea should be clear.

All together, these make the output look closer to how pgindent would
make it.

I agree to the fix.

Mmm. (although, somewhat unrelated to this patch set) I tried this but
I found that www.unicode.org doesn't respond (for at least these
several days). I'm not sure what is happening here.

wget -O 8859-2.TXT --no-use-server-timestamps https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT
--2021-06-22 17:09:34-- https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT
Resolving www.unicode.org (www.unicode.org)... 66.34.208.12
Connecting to www.unicode.org (www.unicode.org)|66.34.208.12|:443...

(timeouts)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Eisentraut (#1)
Re: improvements in Unicode tables generation code

On 22/06/2021 10:20, Peter Eisentraut wrote:

I have accumulated a few patches to improve the output of the scripts in
src/backend/utils/mb/Unicode/ to be less non-standard-looking and fix a
few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make. The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

This could use a comment. At a quick glance, I don't understand what all
the $(wordlist ...) magic does.

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

v1-0002-Make-UCS_to_most.pl-process-encodings-in-sorted-o.patch

This mainly just helps eyeball the output while debugging the previous
patch.

+1

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

I'm surprised the added \n in the perl code didn't result in extra
newlines in the outputs.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

If we do that, let's add the trailing commas to the other arrays too,
not just the combined maps.

No objection, but how does this help the next patch?

If we want to avoid the stray commas (and I think they are a little
ugly, but that's a matter of taste), we could adopt the approach that
print_radix_table() uses to avoid the comma. That seems simpler than
what print_from_utf8_combined_map and print_to_utf8_combined_map are doing.

v1-0005-Fix-indentation-in-generated-output.patch

This changes the indentation in the output from two spaces to a tab.

I haven't included the actual output changes in the last patch, because
they would be huge, but the idea should be clear.

All together, these make the output look closer to how pgindent would
make it.

Thanks!

- Heikki

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: improvements in Unicode tables generation code

At Tue, 22 Jun 2021 11:20:46 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in

On 22/06/2021 10:20, Peter Eisentraut wrote:

v1-0004-Simplify-code-generation-code.patch
This simplifies the code a bit, which helps with the next patch.

If we do that, let's add the trailing commas to the other arrays too,
not just the combined maps.

No objection, but how does this help the next patch?

If we want to avoid the stray commas (and I think they are a little
ugly, but that's a matter of taste), we could adopt the approach that
print_radix_table() uses to avoid the comma. That seems simpler than
what print_from_utf8_combined_map and print_to_utf8_combined_map are
doing.

+1 for adopting the same method with print_radix_table *if* we do want
to avoid the stray commans.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Heikki Linnakangas (#3)
Re: improvements in Unicode tables generation code

On 22.06.21 10:20, Heikki Linnakangas wrote:

On 22/06/2021 10:20, Peter Eisentraut wrote:

I have accumulated a few patches to improve the output of the scripts in
src/backend/utils/mb/Unicode/ to be less non-standard-looking and fix a
few other minor things in that area.

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make.  The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

This could use a comment. At a quick glance, I don't understand what all
the $(wordlist ...) magic does.

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time.
I'll take another look at that.

v1-0003-Remove-some-whitespace-in-generated-C-output.patch

Improve a small formatting issue in the output.

I'm surprised the added \n in the perl code didn't result in extra
newlines in the outputs.

True, I'll have to check that again. I suspect that \n actually belongs
to patch 0004.

v1-0004-Simplify-code-generation-code.patch

This simplifies the code a bit, which helps with the next patch.

If we do that, let's add the trailing commas to the other arrays too,
not just the combined maps.

No objection, but how does this help the next patch?

Mainly it just moves things around so that each print normally starts at
the beginning of a line and concludes with a \n.

#6Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#5)
Re: improvements in Unicode tables generation code

On 23.06.21 10:55, Peter Eisentraut wrote:

v1-0001-Make-Unicode-makefile-more-parallel-safe.patch

The makefile rule that calls UCS_to_most.pl was written incorrectly for
parallel make.  The script writes all output files in one go, but the
rule as written would call the command once for each output file in
parallel.

This could use a comment. At a quick glance, I don't understand what
all the $(wordlist ...) magic does.

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time. I'll
take another look at that.

Here is a patch that does it one file (pair) at a time. The other rules
besides UCS_to_most.pl actually had the same problem, since they produce
two output files, so running in parallel called each script twice. In
this patch, all of that is heavily refactored and works correctly now.
Note that UCS_to_most.pl already accepted a command-line argument to
specify which encoding to work with.

Attachments:

0001-Make-Unicode-makefile-parallel-safe.patchtext/plain; charset=UTF-8; name=0001-Make-Unicode-makefile-parallel-safe.patch; x-mac-creator=0; x-mac-type=0Download+45-90
#7Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#6)
Re: improvements in Unicode tables generation code

On 20.07.21 13:57, Peter Eisentraut wrote:

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time.
I'll take another look at that.

Here is a patch that does it one file (pair) at a time.  The other rules
besides UCS_to_most.pl actually had the same problem, since they produce
two output files, so running in parallel called each script twice.  In
this patch, all of that is heavily refactored and works correctly now.
Note that UCS_to_most.pl already accepted a command-line argument to
specify which encoding to work with.

Here is an updated patch with a thinko fix that made the previous patch
not actually work.

Attachments:

v2-0001-Make-Unicode-makefile-parallel-safe.patchtext/plain; charset=UTF-8; name=v2-0001-Make-Unicode-makefile-parallel-safe.patch; x-mac-creator=0; x-mac-type=0Download+45-90
#8Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#7)
Re: improvements in Unicode tables generation code

On 28.09.21 10:25, Peter Eisentraut wrote:

On 20.07.21 13:57, Peter Eisentraut wrote:

Perhaps we should change the script or Makefile so that it doesn't
create all the maps in one go?

I agree, either comment it better or just write one file at a time.
I'll take another look at that.

Here is a patch that does it one file (pair) at a time.  The other
rules besides UCS_to_most.pl actually had the same problem, since they
produce two output files, so running in parallel called each script
twice.  In this patch, all of that is heavily refactored and works
correctly now. Note that UCS_to_most.pl already accepted a
command-line argument to specify which encoding to work with.

Here is an updated patch with a thinko fix that made the previous patch
not actually work.

I have committed this one and closed the CF entry.