badly calculated width of emoji in psql

Started by Pavel Stehulealmost 5 years ago37 messages
#1Pavel Stehule
pavel.stehule@gmail.com

Hi

I am checked an query from
https://www.depesz.com/2021/04/01/waiting-for-postgresql-14-add-unistr-function/
article.

postgres=# SELECT U&'\+01F603';
┌──────────┐
│ ?column? │
╞══════════╡
│ 😃 │
└──────────┘
(1 row)

The result is not correct. Emoji has width 2 chars, but psql calculates
with just one char.

Regards

Pavel

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Pavel Stehule (#1)
Re: badly calculated width of emoji in psql

On Fri, 2021-04-02 at 10:45 +0200, Pavel Stehule wrote:

I am checked an query from https://www.depesz.com/2021/04/01/waiting-for-postgresql-14-add-unistr-function/ article.

postgres=# SELECT U&'\+01F603';
┌──────────┐
│ ?column? │
╞══════════╡
│ 😃 │
└──────────┘
(1 row)

The result is not correct. Emoji has width 2 chars, but psql calculates with just one char.

How about this:

diff --git a/src/common/wchar.c b/src/common/wchar.c
index 6e7d731e02..e2d0d9691c 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -673,7 +673,8 @@ ucs_wcwidth(pg_wchar ucs)
 		  (ucs >= 0xfe30 && ucs <= 0xfe6f) ||	/* CJK Compatibility Forms */
 		  (ucs >= 0xff00 && ucs <= 0xff5f) ||	/* Fullwidth Forms */
 		  (ucs >= 0xffe0 && ucs <= 0xffe6) ||
-		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
+		  (ucs >= 0x20000 && ucs <= 0x2ffff) ||
+		  (ucs >= 0x1f300 && ucs <= 0x1faff)));	/* symbols and emojis */
 }

/*

This is guesswork based on the unicode entries that look like symbols.

Yours,
Laurenz Albe

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Laurenz Albe (#2)
Re: badly calculated width of emoji in psql

pá 2. 4. 2021 v 11:37 odesílatel Laurenz Albe <laurenz.albe@cybertec.at>
napsal:

On Fri, 2021-04-02 at 10:45 +0200, Pavel Stehule wrote:

I am checked an query from

https://www.depesz.com/2021/04/01/waiting-for-postgresql-14-add-unistr-function/
article.

postgres=# SELECT U&'\+01F603';
┌──────────┐
│ ?column? │
╞══════════╡
│ 😃 │
└──────────┘
(1 row)

The result is not correct. Emoji has width 2 chars, but psql calculates

with just one char.

How about this:

diff --git a/src/common/wchar.c b/src/common/wchar.c
index 6e7d731e02..e2d0d9691c 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -673,7 +673,8 @@ ucs_wcwidth(pg_wchar ucs)
(ucs >= 0xfe30 && ucs <= 0xfe6f) ||   /* CJK
Compatibility Forms */
(ucs >= 0xff00 && ucs <= 0xff5f) ||   /* Fullwidth Forms
*/
(ucs >= 0xffe0 && ucs <= 0xffe6) ||
-                 (ucs >= 0x20000 && ucs <= 0x2ffff)));
+                 (ucs >= 0x20000 && ucs <= 0x2ffff) ||
+                 (ucs >= 0x1f300 && ucs <= 0x1faff))); /* symbols and
emojis */
}

/*

This is guesswork based on the unicode entries that look like symbols.

it helps

with this patch, the formatting is correct

Pavel

Show quoted text

Yours,
Laurenz Albe

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Pavel Stehule (#3)
Re: badly calculated width of emoji in psql

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1]http://www.unicode.org/reports/tr51/ says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2]https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

[1]: http://www.unicode.org/reports/tr51/
[2]: https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: badly calculated width of emoji in psql

po 5. 4. 2021 v 7:07 odesílatel Kyotaro Horiguchi <horikyota.ntt@gmail.com>
napsal:

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com>
wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1] says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2] asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

[1] http://www.unicode.org/reports/tr51/
[2] https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

ok

It should be fixed in glibc,

https://sourceware.org/bugzilla/show_bug.cgi?id=20313

so we can check it

Regards

Pavel

Show quoted text

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Jacob Champion
pchampion@vmware.com
In reply to: Kyotaro Horiguchi (#4)
Re: badly calculated width of emoji in psql

On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1] says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2] asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

That is strange.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

--Jacob

#7Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#6)
Re: badly calculated width of emoji in psql

st 7. 7. 2021 v 20:03 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <

pavel.stehule@gmail.com> wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1] says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2] asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

That is strange.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

Gnome terminal does it - VTE does it - there is option how to display chars
with not well specified width.

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

We can check how these chars are printed in most common terminals in modern
versions. I am afraid that it can be problematic to find a solution that
works everywhere, because some libraries on some platforms are pretty
obsolete.

Regards

Pavel

Show quoted text

--Jacob

#8Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#6)
Re: badly calculated width of emoji in psql

On Wed, Jul 07, 2021 at 06:03:34PM +0000, Jacob Champion wrote:

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

I'd say that we are doing our best in guessing what it should be,
then. One cannot predict how fonts are designed.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

Hmm. Such things have a cost, and this one sounds costly with a
limited impact. What do we gain except a better visibility with psql?

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html
--
Michael

#9Pavel Stehule
pavel.stehule@gmail.com
In reply to: Michael Paquier (#8)
Re: badly calculated width of emoji in psql

po 19. 7. 2021 v 9:46 odesílatel Michael Paquier <michael@paquier.xyz>
napsal:

On Wed, Jul 07, 2021 at 06:03:34PM +0000, Jacob Champion wrote:

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

I'd say that we are doing our best in guessing what it should be,
then. One cannot predict how fonts are designed.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

Hmm. Such things have a cost, and this one sounds costly with a
limited impact. What do we gain except a better visibility with psql?

The benefit is correct displaying. I checked impact on server side, and
ucs_wcwidth is used just for calculation of error position. Any other usage
is only in psql.

Moreover, I checked unicode ranges, and I think so for common languages the
performance impact should be zero (because typically use ucs < 0x1100). The
possible (but very low) impact can be for some historic languages or
special symbols. It has not any impact for ranges that currently return
display width 2, because the new range is at the end of list.

I am not sure how wide usage of PQdsplen is outside psql, but I have no
reason to think so, so developers will prefer this function over built
functionality in any developing environment that supports unicode. So in
this case I have a strong opinion to prefer correctness of result against
current speed (note: I have an experience from pspg development, where this
operation is really on critical path, and I tried do some micro
optimization without strong effect - on very big unusual result (very wide,
very long (100K rows) the difference was about 500 ms (on pager side, it
does nothing else than string operations in this moment)).

Regards

Pavel

Show quoted text

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html
--
Michael

#10Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Paquier (#8)
Re: badly calculated width of emoji in psql

On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html

That could be adapted; the question is if the approach as such is
desirable or not. This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

My personal feeling is that something simple and perhaps imperfect
as my one-liner that may miss some corner cases would be ok, but
anything that saps more performance or is complicated would not
be worth the effort.

Yours,
Laurenz Albe

#11Jacob Champion
pchampion@vmware.com
In reply to: Laurenz Albe (#10)
Re: badly calculated width of emoji in psql

On Mon, 2021-07-19 at 13:13 +0200, Laurenz Albe wrote:

On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Femoji%2Fcharts%2Ffull-emoji-list.html&amp;amp;data=04%7C01%7Cpchampion%40vmware.com%7Cbc3f4cff42094f60fa7708d94aa64f11%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637622900429154586%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;amp;sdata=lfSsqU%2BEiSJrwftt9FL13ib7pw0Mzt5DYl%2BSjL2%2Bm%2F0%3D&amp;amp;reserved=0

On my machine, the regional indicator codes take up one column each
(even though they display as a wide uppercase letter), so making them
wide would break alignment. This seems to match up with Annex #11 [1]http://www.unicode.org/reports/tr11/:

ED4. East Asian Wide (W): All other characters that are always
wide. [...] This category includes characters that have explicit
halfwidth counterparts, along with characters that have the [UTS51]
property Emoji_Presentation, with the exception of characters that
have the [UCD] property Regional_Indicator

So for whatever reason, those indicator codes aren't considered East
Asian Wide by Unicode (and therefore glibc), even though they are
Emoji_Presentation. And glibc appears to be using East Asian Wide as
the flag for a 2-column character.

glibc 2.31 is based on Unicode 12.1, I think. So if Postgres is built
against a Unicode database that's different from the system's,
obviously you'll see odd results no matter what we do here.

And _all_ of that completely ignores the actual country-flag-combining
behavior, which my terminal doesn't do and I assume would be part of a
separate conversation entirely, along with things like ZWJ sequences.

That could be adapted; the question is if the approach as such is
desirable or not. This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

Sure. We already have code in the tree that deals with that moving
target, though, by parsing apart pieces of the Unicode database. So the
added maintenance cost should be pretty low.

My personal feeling is that something simple and perhaps imperfect
as my one-liner that may miss some corner cases would be ok, but
anything that saps more performance or is complicated would not
be worth the effort.

Another data point: on my machine (Ubuntu 20.04, glibc 2.31) that
additional range not only misses a large number of emoji (e.g. in the
2xxx codepoint range), it incorrectly treats some narrow codepoints as
wide (e.g. many in the 1F32x range have Emoji_Presentation set to
false).

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

--Jacob

[1]: http://www.unicode.org/reports/tr11/

#12Jacob Champion
pchampion@vmware.com
In reply to: Jacob Champion (#11)
Re: badly calculated width of emoji in psql

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

On Mon, 2021-07-19 at 13:13 +0200, Laurenz Albe wrote:

That could be adapted; the question is if the approach as such is
desirable or not. This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

Sure. We already have code in the tree that deals with that moving
target, though, by parsing apart pieces of the Unicode database. So the
added maintenance cost should be pretty low.

(I am working on such a patch today and will report back.)

--Jacob

#13Jacob Champion
pchampion@vmware.com
In reply to: Jacob Champion (#11)
1 attachment(s)
Re: badly calculated width of emoji in psql

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

--Jacob

Attachments:

ucs_wcwidth-update-Fullwidth-and-Wide-codepoint-set.patchtext/x-patch; name=ucs_wcwidth-update-Fullwidth-and-Wide-codepoint-set.patchDownload
From e59292beb2aeb1860734ead992cea38f59f6cab6 Mon Sep 17 00:00:00 2001
From: Jacob Champion <pchampion@vmware.com>
Date: Wed, 21 Jul 2021 10:41:05 -0700
Subject: [PATCH] ucs_wcwidth: update Fullwidth and Wide codepoint set

The hardcoded "wide character" set at the end of ucs_wcwidth() was last
touched around the Unicode 5.0 era.  This led to misalignment on modern
platforms when printing emoji and other codepoints that have since been
designated wide/fullwidth.

Use an interval table for these codepoints, and add a recipe to the
existing update-unicode rule to keep it up to date.

TODO: performance implications
---
 src/common/unicode/.gitignore                 |   1 +
 src/common/unicode/Makefile                   |   9 +-
 .../generate-unicode_east_asian_fw_table.pl   |  76 +++++++++++
 src/common/wchar.c                            |  18 +--
 .../common/unicode_east_asian_fw_table.h      | 120 ++++++++++++++++++
 5 files changed, 208 insertions(+), 16 deletions(-)
 create mode 100644 src/common/unicode/generate-unicode_east_asian_fw_table.pl
 create mode 100644 src/include/common/unicode_east_asian_fw_table.h

diff --git a/src/common/unicode/.gitignore b/src/common/unicode/.gitignore
index 512862e538..46243f701d 100644
--- a/src/common/unicode/.gitignore
+++ b/src/common/unicode/.gitignore
@@ -4,5 +4,6 @@
 # Downloaded files
 /CompositionExclusions.txt
 /DerivedNormalizationProps.txt
+/EastAsianWidth.txt
 /NormalizationTest.txt
 /UnicodeData.txt
diff --git a/src/common/unicode/Makefile b/src/common/unicode/Makefile
index eb14add28a..a3683dd86b 100644
--- a/src/common/unicode/Makefile
+++ b/src/common/unicode/Makefile
@@ -18,14 +18,14 @@ LIBS += $(PTHREAD_LIBS)
 # By default, do nothing.
 all:
 
-update-unicode: unicode_norm_table.h unicode_combining_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
+update-unicode: unicode_norm_table.h unicode_combining_table.h unicode_east_asian_fw_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
 	mv $^ ../../../src/include/common/
 	$(MAKE) normalization-check
 
 # These files are part of the Unicode Character Database. Download
 # them on demand.  The dependency on Makefile.global is for
 # UNICODE_VERSION.
-UnicodeData.txt DerivedNormalizationProps.txt CompositionExclusions.txt NormalizationTest.txt: $(top_builddir)/src/Makefile.global
+UnicodeData.txt EastAsianWidth.txt DerivedNormalizationProps.txt CompositionExclusions.txt NormalizationTest.txt: $(top_builddir)/src/Makefile.global
 	$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)
 
 # Generation of conversion tables used for string normalization with
@@ -38,6 +38,9 @@ unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt Composition
 unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
 	$(PERL) $^ >$@
 
+unicode_east_asian_fw_table.h: generate-unicode_east_asian_fw_table.pl EastAsianWidth.txt
+	$(PERL) $^ >$@
+
 unicode_normprops_table.h: generate-unicode_normprops_table.pl DerivedNormalizationProps.txt
 	$(PERL) $^ >$@
 
@@ -64,6 +67,6 @@ clean:
 	rm -f $(OBJS) norm_test norm_test.o
 
 distclean: clean
-	rm -f UnicodeData.txt CompositionExclusions.txt NormalizationTest.txt norm_test_table.h unicode_norm_table.h
+	rm -f UnicodeData.txt EastAsianWidth.txt CompositionExclusions.txt NormalizationTest.txt norm_test_table.h unicode_norm_table.h
 
 maintainer-clean: distclean
diff --git a/src/common/unicode/generate-unicode_east_asian_fw_table.pl b/src/common/unicode/generate-unicode_east_asian_fw_table.pl
new file mode 100644
index 0000000000..d29fdd5157
--- /dev/null
+++ b/src/common/unicode/generate-unicode_east_asian_fw_table.pl
@@ -0,0 +1,76 @@
+#!/usr/bin/perl
+#
+# Generate sorted list of non-overlapping intervals of East Asian Wide (W) and
+# East Asian Fullwidth (F) characters, using Unicode data files as input.  Pass
+# EastAsianWidth.txt as argument.  The output is on stdout.
+#
+# Copyright (c) 2019-2021, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+my $range_start = undef;
+my ($first, $last);
+my $prev_last;
+
+print
+  "/* generated by src/common/unicode/generate-unicode_east_asian_fw_table.pl, do not edit */\n\n";
+
+print "static const struct mbinterval east_asian_fw[] = {\n";
+
+foreach my $line (<ARGV>)
+{
+	chomp $line;
+	$line =~ s/\s*#.*$//;
+	next if $line eq '';
+	my ($codepoint, $width) = split ';', $line;
+
+	if ($codepoint =~ /\.\./)
+	{
+		($first, $last) = split /\.\./, $codepoint;
+	}
+	else
+	{
+		$first = $last = $codepoint;
+	}
+
+	($first, $last) = map(hex, ($first, $last));
+
+	if ($width eq 'F' || $width eq 'W')
+	{
+		# fullwidth/wide characters
+		if (!defined($range_start))
+		{
+			# save for start of range if one hasn't been started yet
+			$range_start = $first;
+		}
+		elsif ($first != $prev_last + 1)
+		{
+			# ranges aren't contiguous; emit the last and start a new one
+			printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_last;
+			$range_start = $first;
+		}
+	}
+	else
+	{
+		# not wide characters, print out previous range if any
+		if (defined($range_start))
+		{
+			printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_last;
+			$range_start = undef;
+		}
+	}
+}
+continue
+{
+	$prev_last = $last;
+}
+
+# don't forget any ranges at the very end of the database (though there are none
+# as of Unicode 13.0)
+if (defined($range_start))
+{
+	printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_last;
+}
+
+print "};\n";
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 0636b8765b..43f1078ae6 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -583,8 +583,8 @@ pg_utf_mblen(const unsigned char *s)
 
 struct mbinterval
 {
-	unsigned short first;
-	unsigned short last;
+	unsigned int first;
+	unsigned int last;
 };
 
 /* auxiliary function for binary search in interval table */
@@ -645,6 +645,7 @@ static int
 ucs_wcwidth(pg_wchar ucs)
 {
 #include "common/unicode_combining_table.h"
+#include "common/unicode_east_asian_fw_table.h"
 
 	/* test for 8-bit control characters */
 	if (ucs == 0)
@@ -663,17 +664,8 @@ ucs_wcwidth(pg_wchar ucs)
 	 */
 
 	return 1 +
-		(ucs >= 0x1100 &&
-		 (ucs <= 0x115f ||		/* Hangul Jamo init. consonants */
-		  (ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
-		   ucs != 0x303f) ||	/* CJK ... Yi */
-		  (ucs >= 0xac00 && ucs <= 0xd7a3) ||	/* Hangul Syllables */
-		  (ucs >= 0xf900 && ucs <= 0xfaff) ||	/* CJK Compatibility
-												 * Ideographs */
-		  (ucs >= 0xfe30 && ucs <= 0xfe6f) ||	/* CJK Compatibility Forms */
-		  (ucs >= 0xff00 && ucs <= 0xff5f) ||	/* Fullwidth Forms */
-		  (ucs >= 0xffe0 && ucs <= 0xffe6) ||
-		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
+		mbbisearch(ucs, east_asian_fw,
+				   sizeof(east_asian_fw) / sizeof(struct mbinterval) - 1);
 }
 
 /*
diff --git a/src/include/common/unicode_east_asian_fw_table.h b/src/include/common/unicode_east_asian_fw_table.h
new file mode 100644
index 0000000000..b27f95b5dc
--- /dev/null
+++ b/src/include/common/unicode_east_asian_fw_table.h
@@ -0,0 +1,120 @@
+/* generated by src/common/unicode/generate-unicode_east_asian_fw_table.pl, do not edit */
+
+static const struct mbinterval east_asian_fw[] = {
+	{0x1100, 0x115F},
+	{0x231A, 0x231B},
+	{0x2329, 0x232A},
+	{0x23E9, 0x23EC},
+	{0x23F0, 0x23F0},
+	{0x23F3, 0x23F3},
+	{0x25FD, 0x25FE},
+	{0x2614, 0x2615},
+	{0x2648, 0x2653},
+	{0x267F, 0x267F},
+	{0x2693, 0x2693},
+	{0x26A1, 0x26A1},
+	{0x26AA, 0x26AB},
+	{0x26BD, 0x26BE},
+	{0x26C4, 0x26C5},
+	{0x26CE, 0x26CE},
+	{0x26D4, 0x26D4},
+	{0x26EA, 0x26EA},
+	{0x26F2, 0x26F3},
+	{0x26F5, 0x26F5},
+	{0x26FA, 0x26FA},
+	{0x26FD, 0x26FD},
+	{0x2705, 0x2705},
+	{0x270A, 0x270B},
+	{0x2728, 0x2728},
+	{0x274C, 0x274C},
+	{0x274E, 0x274E},
+	{0x2753, 0x2755},
+	{0x2757, 0x2757},
+	{0x2795, 0x2797},
+	{0x27B0, 0x27B0},
+	{0x27BF, 0x27BF},
+	{0x2B1B, 0x2B1C},
+	{0x2B50, 0x2B50},
+	{0x2B55, 0x2B55},
+	{0x2E80, 0x2E99},
+	{0x2E9B, 0x2EF3},
+	{0x2F00, 0x2FD5},
+	{0x2FF0, 0x2FFB},
+	{0x3000, 0x303E},
+	{0x3041, 0x3096},
+	{0x3099, 0x30FF},
+	{0x3105, 0x312F},
+	{0x3131, 0x318E},
+	{0x3190, 0x31E3},
+	{0x31F0, 0x321E},
+	{0x3220, 0x3247},
+	{0x3250, 0x4DBF},
+	{0x4E00, 0xA48C},
+	{0xA490, 0xA4C6},
+	{0xA960, 0xA97C},
+	{0xAC00, 0xD7A3},
+	{0xF900, 0xFAFF},
+	{0xFE10, 0xFE19},
+	{0xFE30, 0xFE52},
+	{0xFE54, 0xFE66},
+	{0xFE68, 0xFE6B},
+	{0xFF01, 0xFF60},
+	{0xFFE0, 0xFFE6},
+	{0x16FE0, 0x16FE4},
+	{0x16FF0, 0x16FF1},
+	{0x17000, 0x187F7},
+	{0x18800, 0x18CD5},
+	{0x18D00, 0x18D08},
+	{0x1B000, 0x1B11E},
+	{0x1B150, 0x1B152},
+	{0x1B164, 0x1B167},
+	{0x1B170, 0x1B2FB},
+	{0x1F004, 0x1F004},
+	{0x1F0CF, 0x1F0CF},
+	{0x1F18E, 0x1F18E},
+	{0x1F191, 0x1F19A},
+	{0x1F200, 0x1F202},
+	{0x1F210, 0x1F23B},
+	{0x1F240, 0x1F248},
+	{0x1F250, 0x1F251},
+	{0x1F260, 0x1F265},
+	{0x1F300, 0x1F320},
+	{0x1F32D, 0x1F335},
+	{0x1F337, 0x1F37C},
+	{0x1F37E, 0x1F393},
+	{0x1F3A0, 0x1F3CA},
+	{0x1F3CF, 0x1F3D3},
+	{0x1F3E0, 0x1F3F0},
+	{0x1F3F4, 0x1F3F4},
+	{0x1F3F8, 0x1F43E},
+	{0x1F440, 0x1F440},
+	{0x1F442, 0x1F4FC},
+	{0x1F4FF, 0x1F53D},
+	{0x1F54B, 0x1F54E},
+	{0x1F550, 0x1F567},
+	{0x1F57A, 0x1F57A},
+	{0x1F595, 0x1F596},
+	{0x1F5A4, 0x1F5A4},
+	{0x1F5FB, 0x1F64F},
+	{0x1F680, 0x1F6C5},
+	{0x1F6CC, 0x1F6CC},
+	{0x1F6D0, 0x1F6D2},
+	{0x1F6D5, 0x1F6D7},
+	{0x1F6EB, 0x1F6EC},
+	{0x1F6F4, 0x1F6FC},
+	{0x1F7E0, 0x1F7EB},
+	{0x1F90C, 0x1F93A},
+	{0x1F93C, 0x1F945},
+	{0x1F947, 0x1F978},
+	{0x1F97A, 0x1F9CB},
+	{0x1F9CD, 0x1F9FF},
+	{0x1FA70, 0x1FA74},
+	{0x1FA78, 0x1FA7A},
+	{0x1FA80, 0x1FA86},
+	{0x1FA90, 0x1FAA8},
+	{0x1FAB0, 0x1FAB6},
+	{0x1FAC0, 0x1FAC2},
+	{0x1FAD0, 0x1FAD6},
+	{0x20000, 0x2FFFD},
+	{0x30000, 0x3FFFD},
+};
-- 
2.25.1

#14Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#13)
Re: badly calculated width of emoji in psql

Hi

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

I can confirm that the original issue is fixed.

I tested performance

I had three data sets

1. typical data - mix ascii and utf characters typical for czech language -
25K lines - there is very small slowdown 2ms from 24 to 26ms (stored file
of this result has 3MB)

2. the worst case - this reports has only emoji 1000 chars * 10K rows -
there is more significant slowdown - from 160 ms to 220 ms (stored file has
39MB)

3. a little bit of obscure datasets generated by \x and select * from
pg_proc - it has 99K lines - and there are a lot of unicode decorations
(borders). The line has 17K chars. (the stored file has 1.7GB)
In this dataset I see a slowdown from 4300 to 4700 ms.

In all cases, the data are in memory (in filesystem cache). I tested load
to pspg.

9% looks too high, but in absolute time it is 400ms for 99K lines and very
untypical data, or 2ms for more typical results., 2ms are nothing (for
interactive work). More - this is from a pspg perspective. In psql there
can be overhead of network, protocol processing, formatting, and more and
more, and psql doesn't need to calculate display width of decorations
(borders), what is the reason for slowdowns in pspg.

Pavel

Show quoted text

--Jacob

#15Jacob Champion
pchampion@vmware.com
In reply to: Pavel Stehule (#14)
2 attachment(s)
Re: badly calculated width of emoji in psql

On Fri, 2021-07-23 at 17:42 +0200, Pavel Stehule wrote:

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion@vmware.com> napsal:

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

I can confirm that the original issue is fixed.

Great!

I tested performance

I had three data sets

1. typical data - mix ascii and utf characters typical for czech
language - 25K lines - there is very small slowdown 2ms from 24 to
26ms (stored file of this result has 3MB)

2. the worst case - this reports has only emoji 1000 chars * 10K rows
- there is more significant slowdown - from 160 ms to 220 ms (stored
file has 39MB)

I assume the stored file size has grown with this patch, since we're
now printing the correct number of spacing/border characters?

3. a little bit of obscure datasets generated by \x and select * from
pg_proc - it has 99K lines - and there are a lot of unicode
decorations (borders). The line has 17K chars. (the stored file has
1.7GB)
In this dataset I see a slowdown from 4300 to 4700 ms.

These results are lining up fairly well with my profiling. To isolate
the effects of the new algorithm (as opposed to printing time) I
redirected to /dev/null:

psql postgres -c "select repeat(unistr('\u115D'), 10000000);" > /dev/null

This is what I expect to be the worst case for the new patch: a huge
string consisting of nothing but \u115D, which is in the first interval
and therefore takes the maximum number of iterations during the binary
search.

For that command, the wall time slowdown with this patch was about
240ms (from 1.128s to 1.366s, or 21% slower). Callgrind shows an
increase of 18% in the number of instructions executed with the
interval table patch, all of it coming from PQdsplen (no surprise).
PQdsplen itself has a 36% increase in instruction count for that run.

I also did a microbenchmark of PQdsplen (code attached, requires Google
Benchmark [1]https://github.com/google/benchmark). The three cases I tested were standard ASCII
characters, a smiley-face emoji, and the worst-case \u115F character.

Without the patch:

------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
...
BM_Ascii_mean 4.97 ns 4.97 ns 5
BM_Ascii_median 4.97 ns 4.97 ns 5
BM_Ascii_stddev 0.035 ns 0.035 ns 5
...
BM_Emoji_mean 6.30 ns 6.30 ns 5
BM_Emoji_median 6.30 ns 6.30 ns 5
BM_Emoji_stddev 0.045 ns 0.045 ns 5
...
BM_Worst_mean 12.4 ns 12.4 ns 5
BM_Worst_median 12.4 ns 12.4 ns 5
BM_Worst_stddev 0.038 ns 0.038 ns 5

With the patch:

------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
...
BM_Ascii_mean 4.59 ns 4.59 ns 5
BM_Ascii_median 4.60 ns 4.60 ns 5
BM_Ascii_stddev 0.069 ns 0.068 ns 5
...
BM_Emoji_mean 11.8 ns 11.8 ns 5
BM_Emoji_median 11.8 ns 11.8 ns 5
BM_Emoji_stddev 0.059 ns 0.059 ns 5
...
BM_Worst_mean 18.5 ns 18.5 ns 5
BM_Worst_median 18.5 ns 18.5 ns 5
BM_Worst_stddev 0.077 ns 0.077 ns 5

So an incredibly tiny improvement in the ASCII case, which is
reproducible across multiple runs and not just a fluke (I assume
because the code is smaller now and has better cache line
characteristics?). A ~90% slowdown for the emoji case, and a ~50%
slowdown for the worst-performing characters. That seems perfectly
reasonable considering we're talking about dozens of nanoseconds.

9% looks too high, but in absolute time it is 400ms for 99K lines and
very untypical data, or 2ms for more typical results., 2ms are
nothing (for interactive work). More - this is from a pspg
perspective. In psql there can be overhead of network, protocol
processing, formatting, and more and more, and psql doesn't need to
calculate display width of decorations (borders), what is the reason
for slowdowns in pspg.

Yeah. Considering the alignment code is for user display, the absolute
performance is going to dominate, and I don't see any red flags so far.
If you're regularly dealing with unbelievably huge amounts of emoji, I
think the amount of extra time we're seeing here is unlikely to be a
problem. If it is, you can always turn alignment off. (Do you rely on
horizontal alignment for lines with millions of characters, anyway?)

Laurenz, Michael, what do you think?

--Jacob

[1]: https://github.com/google/benchmark

Attachments:

main.cpptext/x-c++src; name=main.cppDownload
Makefiletext/x-makefile; name=MakefileDownload
#16Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#13)
Re: badly calculated width of emoji in psql

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

This patch fixed badly formatted tables with emoji.

I checked this patch, and it is correct and a step forward, because it
dynamically sets intervals of double wide characters, and the code is more
readable.

I checked and performance, and although there is measurable slowdown, it is
negligible in absolute values. Previous code was a little bit faster - it
checked less ranges, but was not fully correct and up to date.

The patching was without problems
There are no regress tests, but I am not sure so they are necessary for
this case.
make check-world passed without problems

I'll mark this patch as ready for committer

Regards

Pavel

Show quoted text

--Jacob

#17John Naylor
john.naylor@enterprisedb.com
In reply to: Pavel Stehule (#16)
Re: badly calculated width of emoji in psql

I tried this patch on and MacOS11/iterm2 and RHEL 7 (ssh'd from the Mac, in
case that matters) and the example shown at the top of the thread shows no
difference:

john.naylor=# \pset border 2
Border style is 2.
john.naylor=# SELECT U&'\+01F603';
+----------+
| ?column? |
+----------+
| 😃 |
+----------+
(1 row)

(In case it doesn't render locally, the right bar in the result cell is
still shifted to the right.

What is the expected context to show a behavior change? Does one need some
specific terminal or setting?

--
John Naylor
EDB: http://www.enterprisedb.com

#18Pavel Stehule
pavel.stehule@gmail.com
In reply to: John Naylor (#17)
2 attachment(s)
Re: badly calculated width of emoji in psql

čt 12. 8. 2021 v 18:36 odesílatel John Naylor <john.naylor@enterprisedb.com>
napsal:

I tried this patch on and MacOS11/iterm2 and RHEL 7 (ssh'd from the Mac,
in case that matters) and the example shown at the top of the thread shows
no difference:

john.naylor=# \pset border 2
Border style is 2.
john.naylor=# SELECT U&'\+01F603';
+----------+
| ?column? |
+----------+
| 😃 |
+----------+
(1 row)

did you run make clean?

when I executed just patch & make, it didn't work

(In case it doesn't render locally, the right bar in the result cell is
still shifted to the right.

What is the expected context to show a behavior change? Does one need some
specific terminal or setting?

I assigned screenshots

Show quoted text

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

broken.pngimage/png; name=broken.pngDownload
ok.pngimage/png; name=ok.pngDownload
#19Jacob Champion
pchampion@vmware.com
In reply to: John Naylor (#17)
Re: badly calculated width of emoji in psql

On Thu, 2021-08-12 at 12:36 -0400, John Naylor wrote:

I tried this patch on and MacOS11/iterm2 and RHEL 7 (ssh'd from the Mac, in case that matters) and the example shown at the top of the thread shows no difference:

john.naylor=# \pset border 2
Border style is 2.
john.naylor=# SELECT U&'\+01F603';
+----------+
| ?column? |
+----------+
| 😃 |
+----------+
(1 row)

(In case it doesn't render locally, the right bar in the result cell is still shifted to the right.

What is the expected context to show a behavior change?

There shouldn't be anything special. (If your terminal was set up to
display emoji in single columns, that would cause alignment issues, but
in the opposite direction to the one you're seeing.)

Does one need some specific terminal or setting?

In your case, an incorrect number of spaces are being printed, so it
shouldn't have anything to do with your terminal settings.

Was this a clean build? Perhaps I've introduced (or exacerbated) a
dependency bug in the Makefile? The patch doing nothing is a surprising
result given the code change.

--Jacob

#20John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#19)
Re: badly calculated width of emoji in psql

On Thu, Aug 12, 2021 at 12:46 PM Pavel Stehule <pavel.stehule@gmail.com>
wrote:

did you run make clean?

when I executed just patch & make, it didn't work

I did not, but I always have --enable-depend on. I tried again with make
clean, and ccache -C just in case, and it works now.

On Thu, Aug 12, 2021 at 12:54 PM Jacob Champion <pchampion@vmware.com>
wrote:

Was this a clean build? Perhaps I've introduced (or exacerbated) a
dependency bug in the Makefile? The patch doing nothing is a surprising
result given the code change.

Yeah, given that Pavel had the same issue, that's a possibility. I don't
recall that happening with other unicode patches I've tested.

--
John Naylor
EDB: http://www.enterprisedb.com

#21John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#20)
1 attachment(s)
Re: badly calculated width of emoji in psql

The patch looks pretty good to me. I just have a stylistic suggestion which
I've attached as a text file. There are also some outdated comments that
are not the responsibility of this patch, but I kind of want to fix them
now:

* - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
* have a column width of 0.

We got rid of this range in d8594d123c1, which is correct.

* - Other format characters (general category code Cf in the Unicode
* database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.

We don't treat Cf the same as Me or Mn, and I believe that's deliberate. We
also no longer have the exception for zero-width space.

It seems the consensus so far is that performance is not an issue, and I'm
inclined to agree.

I'm a bit concerned about the build dependencies not working right, but
it's not clear it's even due to the patch. I'll spend some time
investigating next week.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

ucs-jcn-addendum.txttext/plain; charset=US-ASCII; name=ucs-jcn-addendum.txtDownload
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 43f1078ae6..467cb8921a 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -623,12 +623,6 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
  *		category code Mn or Me in the Unicode database) have a
  *		column width of 0.
  *
- *	  - Other format characters (general category code Cf in the Unicode
- *		database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
- *
- *	  - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
- *		have a column width of 0.
- *
  *	  - Spacing characters in the East Asian Wide (W) or East Asian
  *		FullWidth (F) category as defined in Unicode Technical
  *		Report #11 have a column width of 2.
@@ -659,13 +653,12 @@ ucs_wcwidth(pg_wchar ucs)
 				   sizeof(combining) / sizeof(struct mbinterval) - 1))
 		return 0;
 
-	/*
-	 * if we arrive here, ucs is not a combining or C0/C1 control character
-	 */
+	/* binary search in table of wide characters */
+	if (mbbisearch(ucs, east_asian_fw,
+				   sizeof(east_asian_fw) / sizeof(struct mbinterval) - 1))
+		return 2;
 
-	return 1 +
-		mbbisearch(ucs, east_asian_fw,
-				   sizeof(east_asian_fw) / sizeof(struct mbinterval) - 1);
+	return 1;
 }
 
 /*
#22Jacob Champion
pchampion@vmware.com
In reply to: John Naylor (#21)
Re: badly calculated width of emoji in psql

On Thu, 2021-08-12 at 17:13 -0400, John Naylor wrote:

The patch looks pretty good to me. I just have a stylistic suggestion
which I've attached as a text file.

Getting rid of the "clever addition" looks much better to me, thanks. I
haven't verified the changes to the doc comment, but your description
seems reasonable.

I'm a bit concerned about the build dependencies not working right,
but it's not clear it's even due to the patch. I'll spend some time
investigating next week.

If I vandalize src/common/wchar.c on HEAD, say by deleting the contents
of pg_wchar_table, and then run `make install`, then libpq doesn't get
rebuilt and there's no effect on the frontend. The postgres executable
does get rebuilt for the backend.

It looks like src/interfaces/libpq/Makefile doesn't have a dependency
on libpgcommon (or libpgport, for that matter). For comparison,
src/backend/Makefile has this:

OBJS = \
$(LOCALOBJS) \
$(SUBDIROBJS) \
$(top_builddir)/src/common/libpgcommon_srv.a \
$(top_builddir)/src/port/libpgport_srv.a

So I think that's a bug that needs to be fixed independently, whether
this patch goes in or not.

--Jacob

#23Michael Paquier
michael@paquier.xyz
In reply to: John Naylor (#21)
Re: badly calculated width of emoji in psql

On Thu, Aug 12, 2021 at 05:13:31PM -0400, John Naylor wrote:

I'm a bit concerned about the build dependencies not working right, but
it's not clear it's even due to the patch. I'll spend some time
investigating next week.

How large do libpgcommon deliverables get with this patch? Skimming
through the patch, that just looks like a couple of bytes, still.
--
Michael

#24John Naylor
john.naylor@enterprisedb.com
In reply to: Michael Paquier (#23)
Re: badly calculated width of emoji in psql

On Sun, Aug 15, 2021 at 10:45 PM Michael Paquier <michael@paquier.xyz>
wrote:

On Thu, Aug 12, 2021 at 05:13:31PM -0400, John Naylor wrote:

I'm a bit concerned about the build dependencies not working right, but
it's not clear it's even due to the patch. I'll spend some time
investigating next week.

How large do libpgcommon deliverables get with this patch? Skimming
through the patch, that just looks like a couple of bytes, still.

More like a couple thousand bytes. That's because the width of mbinterval
doubled. If this is not desirable, we could teach the scripts to adjust the
width of the interval type depending on the largest character they saw.

--
John Naylor
EDB: http://www.enterprisedb.com

#25Jacob Champion
pchampion@vmware.com
In reply to: John Naylor (#24)
Re: badly calculated width of emoji in psql

On Mon, 2021-08-16 at 11:24 -0400, John Naylor wrote:

On Sun, Aug 15, 2021 at 10:45 PM Michael Paquier <michael@paquier.xyz> wrote:

How large do libpgcommon deliverables get with this patch? Skimming
through the patch, that just looks like a couple of bytes, still.

More like a couple thousand bytes. That's because the width
of mbinterval doubled. If this is not desirable, we could teach the
scripts to adjust the width of the interval type depending on the
largest character they saw.

True. Note that the combining character table currently excludes
codepoints outside of the BMP, so if someone wants combinations in
higher planes to be handled correctly in the future, the mbinterval for
that table may have to be widened anyway.

--Jacob

#26John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#25)
Re: badly calculated width of emoji in psql

On Mon, Aug 16, 2021 at 1:04 PM Jacob Champion <pchampion@vmware.com> wrote:

On Mon, 2021-08-16 at 11:24 -0400, John Naylor wrote:

On Sun, Aug 15, 2021 at 10:45 PM Michael Paquier <michael@paquier.xyz>

wrote:

How large do libpgcommon deliverables get with this patch? Skimming
through the patch, that just looks like a couple of bytes, still.

More like a couple thousand bytes. That's because the width
of mbinterval doubled. If this is not desirable, we could teach the
scripts to adjust the width of the interval type depending on the
largest character they saw.

True. Note that the combining character table currently excludes
codepoints outside of the BMP, so if someone wants combinations in
higher planes to be handled correctly in the future, the mbinterval for
that table may have to be widened anyway.

Hmm, somehow it escaped my attention that the combining character table
script explicitly excludes those. There's no comment about it. Maybe best
to ask Peter E. (CC'd)

Peter, does the combining char table exclude values > 0xFFFF for size
reasons, correctness, or some other consideration?

--
John Naylor
EDB: http://www.enterprisedb.com

#27John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#24)
Re: badly calculated width of emoji in psql

I wrote:

On Sun, Aug 15, 2021 at 10:45 PM Michael Paquier <michael@paquier.xyz>

wrote:

On Thu, Aug 12, 2021 at 05:13:31PM -0400, John Naylor wrote:

I'm a bit concerned about the build dependencies not working right,

but

it's not clear it's even due to the patch. I'll spend some time
investigating next week.

How large do libpgcommon deliverables get with this patch? Skimming
through the patch, that just looks like a couple of bytes, still.

More like a couple thousand bytes. That's because the width of mbinterval

doubled. If this is not desirable, we could teach the scripts to adjust the
width of the interval type depending on the largest character they saw.

For src/common/libpgcommon.a, in a non-assert / non-debug build:
master: 254912
patch: 256432

And if I go further and remove the limit on the largest character in the
combining table, I get 257248, which is still a relatively small difference.

I had a couple further thoughts:

1. The coding historically used normal comparison and branching for
everything, but recently it only does that to detect control characters,
and then goes through a binary search (and with this patch, two binary
searches) for everything else. Although the performance regression of the
current patch seems negligible, we could use almost the same branches to
fast-path printable ascii text, like this:

+       /* fast path for printable ASCII characters */
+       if (ucs >= 0x20 || ucs < 0x7f)
+               return 1;
+
        /* test for 8-bit control characters */
        if (ucs == 0)
                return 0;
-       if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
+       if (ucs < 0xa0 || ucs > 0x0010ffff)
                return -1;

2. As written, the patch adds a script that's very close to an existing
one, and emits a new file that has the same type of contents as an existing
one, both of which are #included in one place. I wonder if we should
consider having just one script that ingests both files and emits one file.
All we need is for mbinterval to encode the character width, but we can
probably do that with a bitfield like the normprops table to save space.
Then, we only do one binary search. Opinions?

--
John Naylor
EDB: http://www.enterprisedb.com

#28Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: John Naylor (#26)
Re: badly calculated width of emoji in psql

On 16.08.21 22:06, John Naylor wrote:

Peter, does the combining char table exclude values > 0xFFFF for size
reasons, correctness, or some other consideration?

I don't remember a reason, other than perhaps making the generated table
match the previous manual table in scope. IIRC, the previous table was
ancient, so perhaps from the days before higher Unicode values were
universally supported in the code.

#29Jacob Champion
pchampion@vmware.com
In reply to: John Naylor (#27)
Re: badly calculated width of emoji in psql

On Thu, 2021-08-19 at 13:49 -0400, John Naylor wrote:

I had a couple further thoughts:

1. The coding historically used normal comparison and branching for
everything, but recently it only does that to detect control
characters, and then goes through a binary search (and with this
patch, two binary searches) for everything else. Although the
performance regression of the current patch seems negligible,

If I'm reading the code correctly, ASCII characters don't go through
the binary searches; they're already short-circuited at the beginning
of mbbisearch(). On my machine that's enough for the patch to be a
performance _improvement_ for ASCII, not a regression.

Does adding another short-circuit at the top improve the
microbenchmarks noticeably? I assumed the compiler had pretty well
optimized all that already.

we could use almost the same branches to fast-path printable ascii
text, like this:

+       /* fast path for printable ASCII characters */
+       if (ucs >= 0x20 || ucs < 0x7f)
+               return 1;

Should be && instead of ||, I think.

+
/* test for 8-bit control characters */
if (ucs == 0)
return 0;

-       if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
+       if (ucs < 0xa0 || ucs > 0x0010ffff)
return -1;

2. As written, the patch adds a script that's very close to an
existing one, and emits a new file that has the same type of contents
as an existing one, both of which are #included in one place. I
wonder if we should consider having just one script that ingests both
files and emits one file. All we need is for mbinterval to encode the
character width, but we can probably do that with a bitfield like the
normprops table to save space. Then, we only do one binary search.
Opinions?

I guess it just depends on what the end result looks/performs like.
We'd save seven hops or so in the worst case?

--Jacob

#30John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#29)
4 attachment(s)
Re: badly calculated width of emoji in psql

On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion@vmware.com> wrote:

On Thu, 2021-08-19 at 13:49 -0400, John Naylor wrote:

I had a couple further thoughts:

1. The coding historically used normal comparison and branching for
everything, but recently it only does that to detect control
characters, and then goes through a binary search (and with this
patch, two binary searches) for everything else. Although the
performance regression of the current patch seems negligible,

If I'm reading the code correctly, ASCII characters don't go through
the binary searches; they're already short-circuited at the beginning
of mbbisearch(). On my machine that's enough for the patch to be a
performance _improvement_ for ASCII, not a regression.

I had assumed that there would be a function call, but looking at the
assembly, it's inlined, so you're right.

Should be && instead of ||, I think.

Yes, you're quite right. Clearly I didn't test it. :-) But given the
previous, I won't pursue this further.

2. As written, the patch adds a script that's very close to an
existing one, and emits a new file that has the same type of contents
as an existing one, both of which are #included in one place. I
wonder if we should consider having just one script that ingests both
files and emits one file. All we need is for mbinterval to encode the
character width, but we can probably do that with a bitfield like the
normprops table to save space. Then, we only do one binary search.
Opinions?

I guess it just depends on what the end result looks/performs like.
We'd save seven hops or so in the worst case?

Something like that. Attached is what I had in mind (using real patches to
see what the CF bot thinks):

0001 is a simple renaming
0002 puts the char width inside the mbinterval so we can put arbitrary
values there
0003 is Jacob's patch adjusted to use the same binary search as for
combining characters
0004 removes the ancient limit on combining characters, so the following
works now:

SELECT U&'\+0102E1\+0102E0';
+----------+
| ?column? |
+----------+
| 𐋡𐋠 |
+----------+
(1 row)

I think the adjustments to 0003 result in a cleaner and more extensible
design, but a case could be made otherwise. The former combining table
script is a bit more complex than the sum of its former self and Jacob's
proposed new script, but just slightly.

Also, I checked the behavior of this comment that I proposed to remove
upthread:

- * - Other format characters (general category code Cf in the Unicode
- * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.

We don't handle the latter in our current setup:

SELECT U&'foo\200Bbar';
+----------+
| ?column? |
+----------+
| foobar |
+----------+
(1 row)

Not sure if we should do anything about this. It was an explicit exception
years ago in our vendored manual table, but is not labeled as such in the
official Unicode files.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v2-0004-Extend-collection-of-Unicode-combining-characters.patchapplication/octet-stream; name=v2-0004-Extend-collection-of-Unicode-combining-characters.patchDownload
From bfa81d710bbbac6b82228e2d9f30a1856598a443 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Aug 2021 12:35:32 -0400
Subject: [PATCH v2 4/4] Extend collection of Unicode combining characters to
 beyond the BMP

The former limit was perhaps a carryover from an older hand-coded
table.
---
 .../unicode/generate-unicode_width_table.pl   |   2 -
 src/include/common/unicode_width_table.h      | 102 ++++++++++++++++++
 2 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/src/common/unicode/generate-unicode_width_table.pl b/src/common/unicode/generate-unicode_width_table.pl
index 1f8ddc6331..5a361d82d4 100644
--- a/src/common/unicode/generate-unicode_width_table.pl
+++ b/src/common/unicode/generate-unicode_width_table.pl
@@ -30,8 +30,6 @@ foreach my $line (<$UD>)
 	my @fields = split ';', $line;
 	$codepoint = hex $fields[0];
 
-	next if $codepoint > 0xFFFF;
-
 	if ($fields[2] eq 'Me' || $fields[2] eq 'Mn')
 	{
 		# combining character, save for start of range
diff --git a/src/include/common/unicode_width_table.h b/src/include/common/unicode_width_table.h
index 5b7b6e15e2..b8908d90b0 100644
--- a/src/include/common/unicode_width_table.h
+++ b/src/include/common/unicode_width_table.h
@@ -252,7 +252,93 @@ static const struct mbinterval wcwidth[] = {
 	{0xFE68, 0xFE6B, 2},
 	{0xFF01, 0xFF60, 2},
 	{0xFFE0, 0xFFE6, 2},
+	{0x101FD, 0x101FD, 0},
+	{0x102E0, 0x102E0, 0},
+	{0x10376, 0x1037A, 0},
+	{0x10A01, 0x10A0F, 0},
+	{0x10A38, 0x10A3F, 0},
+	{0x10AE5, 0x10AE6, 0},
+	{0x10D24, 0x10D27, 0},
+	{0x10EAB, 0x10EAC, 0},
+	{0x10F46, 0x10F50, 0},
+	{0x11001, 0x11001, 0},
+	{0x11038, 0x11046, 0},
+	{0x1107F, 0x11081, 0},
+	{0x110B3, 0x110B6, 0},
+	{0x110B9, 0x110BA, 0},
+	{0x11100, 0x11102, 0},
+	{0x11127, 0x1112B, 0},
+	{0x1112D, 0x11134, 0},
+	{0x11173, 0x11173, 0},
+	{0x11180, 0x11181, 0},
+	{0x111B6, 0x111BE, 0},
+	{0x111C9, 0x111CC, 0},
+	{0x111CF, 0x111CF, 0},
+	{0x1122F, 0x11231, 0},
+	{0x11234, 0x11234, 0},
+	{0x11236, 0x11237, 0},
+	{0x1123E, 0x1123E, 0},
+	{0x112DF, 0x112DF, 0},
+	{0x112E3, 0x112EA, 0},
+	{0x11300, 0x11301, 0},
+	{0x1133B, 0x1133C, 0},
+	{0x11340, 0x11340, 0},
+	{0x11366, 0x11374, 0},
+	{0x11438, 0x1143F, 0},
+	{0x11442, 0x11444, 0},
+	{0x11446, 0x11446, 0},
+	{0x1145E, 0x1145E, 0},
+	{0x114B3, 0x114B8, 0},
+	{0x114BA, 0x114BA, 0},
+	{0x114BF, 0x114C0, 0},
+	{0x114C2, 0x114C3, 0},
+	{0x115B2, 0x115B5, 0},
+	{0x115BC, 0x115BD, 0},
+	{0x115BF, 0x115C0, 0},
+	{0x115DC, 0x115DD, 0},
+	{0x11633, 0x1163A, 0},
+	{0x1163D, 0x1163D, 0},
+	{0x1163F, 0x11640, 0},
+	{0x116AB, 0x116AB, 0},
+	{0x116AD, 0x116AD, 0},
+	{0x116B0, 0x116B5, 0},
+	{0x116B7, 0x116B7, 0},
+	{0x1171D, 0x1171F, 0},
+	{0x11722, 0x11725, 0},
+	{0x11727, 0x1172B, 0},
+	{0x1182F, 0x11837, 0},
+	{0x11839, 0x1183A, 0},
+	{0x1193B, 0x1193C, 0},
+	{0x1193E, 0x1193E, 0},
+	{0x11943, 0x11943, 0},
+	{0x119D4, 0x119DB, 0},
+	{0x119E0, 0x119E0, 0},
+	{0x11A01, 0x11A0A, 0},
+	{0x11A33, 0x11A38, 0},
+	{0x11A3B, 0x11A3E, 0},
+	{0x11A47, 0x11A47, 0},
+	{0x11A51, 0x11A56, 0},
+	{0x11A59, 0x11A5B, 0},
+	{0x11A8A, 0x11A96, 0},
+	{0x11A98, 0x11A99, 0},
+	{0x11C30, 0x11C3D, 0},
+	{0x11C3F, 0x11C3F, 0},
+	{0x11C92, 0x11CA7, 0},
+	{0x11CAA, 0x11CB0, 0},
+	{0x11CB2, 0x11CB3, 0},
+	{0x11CB5, 0x11CB6, 0},
+	{0x11D31, 0x11D45, 0},
+	{0x11D47, 0x11D47, 0},
+	{0x11D90, 0x11D91, 0},
+	{0x11D95, 0x11D95, 0},
+	{0x11D97, 0x11D97, 0},
+	{0x11EF3, 0x11EF4, 0},
+	{0x16AF0, 0x16AF4, 0},
+	{0x16B30, 0x16B36, 0},
+	{0x16F4F, 0x16F4F, 0},
+	{0x16F8F, 0x16F92, 0},
 	{0x16FE0, 0x16FE4, 2},
+	{0x16FE4, 0x16FE4, 0},
 	{0x16FF0, 0x16FF1, 2},
 	{0x17000, 0x187F7, 2},
 	{0x18800, 0x18CD5, 2},
@@ -261,6 +347,21 @@ static const struct mbinterval wcwidth[] = {
 	{0x1B150, 0x1B152, 2},
 	{0x1B164, 0x1B167, 2},
 	{0x1B170, 0x1B2FB, 2},
+	{0x1BC9D, 0x1BC9E, 0},
+	{0x1D167, 0x1D169, 0},
+	{0x1D17B, 0x1D182, 0},
+	{0x1D185, 0x1D18B, 0},
+	{0x1D1AA, 0x1D1AD, 0},
+	{0x1D242, 0x1D244, 0},
+	{0x1DA00, 0x1DA36, 0},
+	{0x1DA3B, 0x1DA6C, 0},
+	{0x1DA75, 0x1DA75, 0},
+	{0x1DA84, 0x1DA84, 0},
+	{0x1DA9B, 0x1E02A, 0},
+	{0x1E130, 0x1E136, 0},
+	{0x1E2EC, 0x1E2EF, 0},
+	{0x1E8D0, 0x1E8D6, 0},
+	{0x1E944, 0x1E94A, 0},
 	{0x1F004, 0x1F004, 2},
 	{0x1F0CF, 0x1F0CF, 2},
 	{0x1F18E, 0x1F18E, 2},
@@ -309,4 +410,5 @@ static const struct mbinterval wcwidth[] = {
 	{0x1FAD0, 0x1FAD6, 2},
 	{0x20000, 0x2FFFD, 2},
 	{0x30000, 0x3FFFD, 2},
+	{0xE0100, 0xE01EF, 0},
 };
-- 
2.31.1

v2-0001-Change-references-to-unicode_combining_table-to-u.patchapplication/octet-stream; name=v2-0001-Change-references-to-unicode_combining_table-to-u.patchDownload
From c2d0b18013a5d7fe90a9fd7f467d50c8d3ebec54 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Aug 2021 10:34:26 -0400
Subject: [PATCH v2 1/4] Change references to unicode_combining_table to
 unicode_width_table

No functional changes
---
 src/common/unicode/Makefile                                   | 4 ++--
 ...ode_combining_table.pl => generate-unicode_width_table.pl} | 0
 src/common/wchar.c                                            | 2 +-
 .../{unicode_combining_table.h => unicode_width_table.h}      | 0
 4 files changed, 3 insertions(+), 3 deletions(-)
 rename src/common/unicode/{generate-unicode_combining_table.pl => generate-unicode_width_table.pl} (100%)
 rename src/include/common/{unicode_combining_table.h => unicode_width_table.h} (100%)

diff --git a/src/common/unicode/Makefile b/src/common/unicode/Makefile
index eb14add28a..499e31d59f 100644
--- a/src/common/unicode/Makefile
+++ b/src/common/unicode/Makefile
@@ -18,7 +18,7 @@ LIBS += $(PTHREAD_LIBS)
 # By default, do nothing.
 all:
 
-update-unicode: unicode_norm_table.h unicode_combining_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
+update-unicode: unicode_norm_table.h unicode_width_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
 	mv $^ ../../../src/include/common/
 	$(MAKE) normalization-check
 
@@ -35,7 +35,7 @@ unicode_norm_hashfunc.h: unicode_norm_table.h
 unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt CompositionExclusions.txt
 	$(PERL) generate-unicode_norm_table.pl
 
-unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
+unicode_width_table.h: generate-unicode_width_table.pl UnicodeData.txt
 	$(PERL) $^ >$@
 
 unicode_normprops_table.h: generate-unicode_normprops_table.pl DerivedNormalizationProps.txt
diff --git a/src/common/unicode/generate-unicode_combining_table.pl b/src/common/unicode/generate-unicode_width_table.pl
similarity index 100%
rename from src/common/unicode/generate-unicode_combining_table.pl
rename to src/common/unicode/generate-unicode_width_table.pl
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 0636b8765b..bb97b5f54f 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -644,7 +644,7 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
 static int
 ucs_wcwidth(pg_wchar ucs)
 {
-#include "common/unicode_combining_table.h"
+#include "common/unicode_width_table.h"
 
 	/* test for 8-bit control characters */
 	if (ucs == 0)
diff --git a/src/include/common/unicode_combining_table.h b/src/include/common/unicode_width_table.h
similarity index 100%
rename from src/include/common/unicode_combining_table.h
rename to src/include/common/unicode_width_table.h
-- 
2.31.1

v2-0003-Fix-display-width-of-emoji-and-other-codepoints.patchapplication/octet-stream; name=v2-0003-Fix-display-width-of-emoji-and-other-codepoints.patchDownload
From 0fc7e138b072c358450e53eb10dc6fe7b5f9cb23 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Aug 2021 12:23:50 -0400
Subject: [PATCH v2 3/4] Fix display width of emoji and other codepoints

The hardcoded "wide character" set at the end of ucs_wcwidth() was last
touched around the Unicode 5.0 era.  This led to misalignment on modern
platforms when printing emoji and other codepoints that have since been
designated wide/fullwidth. To fix, extend update-unicode to get the
correct widths from the EastAsianWidth.txt file.

Jacob Champion, with some adjustments by me
Reported and reviewed by Pavel Stehule
Discussion: https://www.postgresql.org/message-id/flat/CAFj8pRCeX21O69YHxmykYySYyprZAqrKWWg0KoGKdjgqcGyygg@mail.gmail.com
---
 src/common/unicode/.gitignore                 |   1 +
 src/common/unicode/Makefile                   |   8 +-
 .../unicode/generate-unicode_width_table.pl   |  99 +++++++++++++--
 src/common/wchar.c                            |  23 +---
 src/include/common/unicode_width_table.h      | 116 ++++++++++++++++++
 5 files changed, 214 insertions(+), 33 deletions(-)

diff --git a/src/common/unicode/.gitignore b/src/common/unicode/.gitignore
index 512862e538..46243f701d 100644
--- a/src/common/unicode/.gitignore
+++ b/src/common/unicode/.gitignore
@@ -4,5 +4,6 @@
 # Downloaded files
 /CompositionExclusions.txt
 /DerivedNormalizationProps.txt
+/EastAsianWidth.txt
 /NormalizationTest.txt
 /UnicodeData.txt
diff --git a/src/common/unicode/Makefile b/src/common/unicode/Makefile
index 499e31d59f..5ddf7d0cb7 100644
--- a/src/common/unicode/Makefile
+++ b/src/common/unicode/Makefile
@@ -25,7 +25,7 @@ update-unicode: unicode_norm_table.h unicode_width_table.h unicode_normprops_tab
 # These files are part of the Unicode Character Database. Download
 # them on demand.  The dependency on Makefile.global is for
 # UNICODE_VERSION.
-UnicodeData.txt DerivedNormalizationProps.txt CompositionExclusions.txt NormalizationTest.txt: $(top_builddir)/src/Makefile.global
+UnicodeData.txt EastAsianWidth.txt DerivedNormalizationProps.txt CompositionExclusions.txt NormalizationTest.txt: $(top_builddir)/src/Makefile.global
 	$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)
 
 # Generation of conversion tables used for string normalization with
@@ -35,8 +35,8 @@ unicode_norm_hashfunc.h: unicode_norm_table.h
 unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt CompositionExclusions.txt
 	$(PERL) generate-unicode_norm_table.pl
 
-unicode_width_table.h: generate-unicode_width_table.pl UnicodeData.txt
-	$(PERL) $^ >$@
+unicode_width_table.h: generate-unicode_width_table.pl UnicodeData.txt EastAsianWidth.txt
+	$(PERL) generate-unicode_width_table.pl >$@
 
 unicode_normprops_table.h: generate-unicode_normprops_table.pl DerivedNormalizationProps.txt
 	$(PERL) $^ >$@
@@ -64,6 +64,6 @@ clean:
 	rm -f $(OBJS) norm_test norm_test.o
 
 distclean: clean
-	rm -f UnicodeData.txt CompositionExclusions.txt NormalizationTest.txt norm_test_table.h unicode_norm_table.h
+	rm -f UnicodeData.txt EastAsianWidth.txt CompositionExclusions.txt NormalizationTest.txt norm_test_table.h unicode_norm_table.h
 
 maintainer-clean: distclean
diff --git a/src/common/unicode/generate-unicode_width_table.pl b/src/common/unicode/generate-unicode_width_table.pl
index 0cf44b029c..1f8ddc6331 100644
--- a/src/common/unicode/generate-unicode_width_table.pl
+++ b/src/common/unicode/generate-unicode_width_table.pl
@@ -1,25 +1,30 @@
 #!/usr/bin/perl
 #
-# Generate sorted list of non-overlapping intervals of non-spacing
-# characters, using Unicode data files as input.  Pass UnicodeData.txt
-# as argument.  The output is on stdout.
+# Generate sorted list of non-overlapping intervals of characters and
+# and their display widths, using Unicode data files as input.
+# The output is on stdout.
 #
 # Copyright (c) 2019-2021, PostgreSQL Global Development Group
 
 use strict;
 use warnings;
 
-my $range_start = undef;
-my $codepoint;
-my $prev_codepoint;
-my $count = 0;
+my $UD;
+my $EAW;
+my @ranges;
 
 print
   "/* generated by src/common/unicode/generate-unicode_width_table.pl, do not edit */\n\n";
 
-print "static const struct mbinterval wcwidth[] = {\n";
+# First get the combining characters (width = 0)
+my $range_start = undef;
+my $codepoint;
+my $prev_codepoint;
+
+open($UD, '<', "UnicodeData.txt")
+  or die "Could not open UnicodeData.txt: $!.";
 
-foreach my $line (<ARGV>)
+foreach my $line (<$UD>)
 {
 	chomp $line;
 	my @fields = split ';', $line;
@@ -40,7 +45,7 @@ foreach my $line (<ARGV>)
 		# not a combining character, print out previous range if any
 		if (defined($range_start))
 		{
-			printf "\t{0x%04X, 0x%04X, 0},\n", $range_start, $prev_codepoint;
+			push @ranges, {first => $range_start, last => $prev_codepoint, width => 0};
 			$range_start = undef;
 		}
 	}
@@ -50,4 +55,78 @@ continue
 	$prev_codepoint = $codepoint;
 }
 
+# Now get the East Asian Wide (W) and East Asian Fullwidth (F) characters (width = 2)
+$range_start = undef;
+my ($first, $last);
+my $prev_last;
+
+open($EAW, '<', "EastAsianWidth.txt")
+  or die "Could not open EastAsianWidth.txt: $!.";
+
+foreach my $line (<$EAW>)
+{
+	chomp $line;
+	$line =~ s/\s*#.*$//;
+	next if $line eq '';
+	my ($codepoint, $width) = split ';', $line;
+
+	if ($codepoint =~ /\.\./)
+	{
+		($first, $last) = split /\.\./, $codepoint;
+	}
+	else
+	{
+		$first = $last = $codepoint;
+	}
+
+	($first, $last) = map(hex, ($first, $last));
+
+	if ($width eq 'F' || $width eq 'W')
+	{
+		# fullwidth/wide characters
+		if (!defined($range_start))
+		{
+			# save for start of range if one hasn't been started yet
+			$range_start = $first;
+		}
+		elsif ($first != $prev_last + 1)
+		{
+			# ranges aren't contiguous; emit the last and start a new one
+			push @ranges, {first => $range_start, last => $prev_last, width => 2};
+			$range_start = $first;
+		}
+	}
+	else
+	{
+		# not wide characters, print out previous range if any
+		if (defined($range_start))
+		{
+			push @ranges, {first => $range_start, last => $prev_last, width => 2};
+			$range_start = undef;
+		}
+	}
+}
+continue
+{
+	$prev_last = $last;
+}
+
+# don't forget any ranges at the very end of the database (though there are none
+# as of Unicode 13.0)
+if (defined($range_start))
+{
+	push @ranges, {first => $range_start, last => $prev_last, width => 2};
+}
+
+close $UD;
+close $EAW;
+
+# emit the sorted ranges with their widths
+print "static const struct mbinterval wcwidth[] = {\n";
+
+foreach my $range (sort {$a->{first} <=> $b->{first}} @ranges)
+{
+	printf "\t{0x%04X, 0x%04X, %d},\n", $range->{first}, $range->{last}, $range->{width};
+}
+
 print "};\n";
diff --git a/src/common/wchar.c b/src/common/wchar.c
index c0397ca139..e7c3f5dd09 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s)
 
 struct mbinterval
 {
-	unsigned short first;
-	unsigned short last;
-	signed short width;
+	unsigned int first;
+	unsigned int last:21;
+	signed int	width:4;
 };
 
 /* auxiliary function for binary search in interval table */
@@ -662,22 +662,7 @@ ucs_wcwidth(pg_wchar ucs)
 	if (range != NULL)
 		return range->width;
 
-	/*
-	 * if we arrive here, ucs is not a combining or C0/C1 control character
-	 */
-
-	return 1 +
-		(ucs >= 0x1100 &&
-		 (ucs <= 0x115f ||		/* Hangul Jamo init. consonants */
-		  (ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
-		   ucs != 0x303f) ||	/* CJK ... Yi */
-		  (ucs >= 0xac00 && ucs <= 0xd7a3) ||	/* Hangul Syllables */
-		  (ucs >= 0xf900 && ucs <= 0xfaff) ||	/* CJK Compatibility
-												 * Ideographs */
-		  (ucs >= 0xfe30 && ucs <= 0xfe6f) ||	/* CJK Compatibility Forms */
-		  (ucs >= 0xff00 && ucs <= 0xff5f) ||	/* Fullwidth Forms */
-		  (ucs >= 0xffe0 && ucs <= 0xffe6) ||
-		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
+	return 1;
 }
 
 /*
diff --git a/src/include/common/unicode_width_table.h b/src/include/common/unicode_width_table.h
index 3b161f47a4..5b7b6e15e2 100644
--- a/src/include/common/unicode_width_table.h
+++ b/src/include/common/unicode_width_table.h
@@ -102,6 +102,7 @@ static const struct mbinterval wcwidth[] = {
 	{0x1085, 0x1086, 0},
 	{0x108D, 0x108D, 0},
 	{0x109D, 0x109D, 0},
+	{0x1100, 0x115F, 2},
 	{0x135D, 0x135F, 0},
 	{0x1712, 0x1714, 0},
 	{0x1732, 0x1734, 0},
@@ -150,11 +151,60 @@ static const struct mbinterval wcwidth[] = {
 	{0x1CF8, 0x1CF9, 0},
 	{0x1DC0, 0x1DFF, 0},
 	{0x20D0, 0x20F0, 0},
+	{0x231A, 0x231B, 2},
+	{0x2329, 0x232A, 2},
+	{0x23E9, 0x23EC, 2},
+	{0x23F0, 0x23F0, 2},
+	{0x23F3, 0x23F3, 2},
+	{0x25FD, 0x25FE, 2},
+	{0x2614, 0x2615, 2},
+	{0x2648, 0x2653, 2},
+	{0x267F, 0x267F, 2},
+	{0x2693, 0x2693, 2},
+	{0x26A1, 0x26A1, 2},
+	{0x26AA, 0x26AB, 2},
+	{0x26BD, 0x26BE, 2},
+	{0x26C4, 0x26C5, 2},
+	{0x26CE, 0x26CE, 2},
+	{0x26D4, 0x26D4, 2},
+	{0x26EA, 0x26EA, 2},
+	{0x26F2, 0x26F3, 2},
+	{0x26F5, 0x26F5, 2},
+	{0x26FA, 0x26FA, 2},
+	{0x26FD, 0x26FD, 2},
+	{0x2705, 0x2705, 2},
+	{0x270A, 0x270B, 2},
+	{0x2728, 0x2728, 2},
+	{0x274C, 0x274C, 2},
+	{0x274E, 0x274E, 2},
+	{0x2753, 0x2755, 2},
+	{0x2757, 0x2757, 2},
+	{0x2795, 0x2797, 2},
+	{0x27B0, 0x27B0, 2},
+	{0x27BF, 0x27BF, 2},
+	{0x2B1B, 0x2B1C, 2},
+	{0x2B50, 0x2B50, 2},
+	{0x2B55, 0x2B55, 2},
 	{0x2CEF, 0x2CF1, 0},
 	{0x2D7F, 0x2D7F, 0},
 	{0x2DE0, 0x2DFF, 0},
+	{0x2E80, 0x2E99, 2},
+	{0x2E9B, 0x2EF3, 2},
+	{0x2F00, 0x2FD5, 2},
+	{0x2FF0, 0x2FFB, 2},
+	{0x3000, 0x303E, 2},
 	{0x302A, 0x302D, 0},
+	{0x3041, 0x3096, 2},
 	{0x3099, 0x309A, 0},
+	{0x3099, 0x30FF, 2},
+	{0x3105, 0x312F, 2},
+	{0x3131, 0x318E, 2},
+	{0x3190, 0x31E3, 2},
+	{0x31F0, 0x321E, 2},
+	{0x3220, 0x3247, 2},
+	{0x3250, 0x4DBF, 2},
+	{0x4E00, 0xA48C, 2},
+	{0xA490, 0xA4C6, 2},
 	{0xA66F, 0xA672, 0},
 	{0xA674, 0xA67D, 0},
 	{0xA69E, 0xA69F, 0},
@@ -169,6 +219,7 @@ static const struct mbinterval wcwidth[] = {
 	{0xA8FF, 0xA8FF, 0},
 	{0xA926, 0xA92D, 0},
 	{0xA947, 0xA951, 0},
+	{0xA960, 0xA97C, 2},
 	{0xA980, 0xA982, 0},
 	{0xA9B3, 0xA9B3, 0},
 	{0xA9B6, 0xA9B9, 0},
@@ -190,7 +241,72 @@ static const struct mbinterval wcwidth[] = {
 	{0xABE5, 0xABE5, 0},
 	{0xABE8, 0xABE8, 0},
 	{0xABED, 0xABED, 0},
+	{0xAC00, 0xD7A3, 2},
+	{0xF900, 0xFAFF, 2},
 	{0xFB1E, 0xFB1E, 0},
 	{0xFE00, 0xFE0F, 0},
+	{0xFE10, 0xFE19, 2},
 	{0xFE20, 0xFE2F, 0},
+	{0xFE30, 0xFE52, 2},
+	{0xFE54, 0xFE66, 2},
+	{0xFE68, 0xFE6B, 2},
+	{0xFF01, 0xFF60, 2},
+	{0xFFE0, 0xFFE6, 2},
+	{0x16FE0, 0x16FE4, 2},
+	{0x16FF0, 0x16FF1, 2},
+	{0x17000, 0x187F7, 2},
+	{0x18800, 0x18CD5, 2},
+	{0x18D00, 0x18D08, 2},
+	{0x1B000, 0x1B11E, 2},
+	{0x1B150, 0x1B152, 2},
+	{0x1B164, 0x1B167, 2},
+	{0x1B170, 0x1B2FB, 2},
+	{0x1F004, 0x1F004, 2},
+	{0x1F0CF, 0x1F0CF, 2},
+	{0x1F18E, 0x1F18E, 2},
+	{0x1F191, 0x1F19A, 2},
+	{0x1F200, 0x1F202, 2},
+	{0x1F210, 0x1F23B, 2},
+	{0x1F240, 0x1F248, 2},
+	{0x1F250, 0x1F251, 2},
+	{0x1F260, 0x1F265, 2},
+	{0x1F300, 0x1F320, 2},
+	{0x1F32D, 0x1F335, 2},
+	{0x1F337, 0x1F37C, 2},
+	{0x1F37E, 0x1F393, 2},
+	{0x1F3A0, 0x1F3CA, 2},
+	{0x1F3CF, 0x1F3D3, 2},
+	{0x1F3E0, 0x1F3F0, 2},
+	{0x1F3F4, 0x1F3F4, 2},
+	{0x1F3F8, 0x1F43E, 2},
+	{0x1F440, 0x1F440, 2},
+	{0x1F442, 0x1F4FC, 2},
+	{0x1F4FF, 0x1F53D, 2},
+	{0x1F54B, 0x1F54E, 2},
+	{0x1F550, 0x1F567, 2},
+	{0x1F57A, 0x1F57A, 2},
+	{0x1F595, 0x1F596, 2},
+	{0x1F5A4, 0x1F5A4, 2},
+	{0x1F5FB, 0x1F64F, 2},
+	{0x1F680, 0x1F6C5, 2},
+	{0x1F6CC, 0x1F6CC, 2},
+	{0x1F6D0, 0x1F6D2, 2},
+	{0x1F6D5, 0x1F6D7, 2},
+	{0x1F6EB, 0x1F6EC, 2},
+	{0x1F6F4, 0x1F6FC, 2},
+	{0x1F7E0, 0x1F7EB, 2},
+	{0x1F90C, 0x1F93A, 2},
+	{0x1F93C, 0x1F945, 2},
+	{0x1F947, 0x1F978, 2},
+	{0x1F97A, 0x1F9CB, 2},
+	{0x1F9CD, 0x1F9FF, 2},
+	{0x1FA70, 0x1FA74, 2},
+	{0x1FA78, 0x1FA7A, 2},
+	{0x1FA80, 0x1FA86, 2},
+	{0x1FA90, 0x1FAA8, 2},
+	{0x1FAB0, 0x1FAB6, 2},
+	{0x1FAC0, 0x1FAC2, 2},
+	{0x1FAD0, 0x1FAD6, 2},
+	{0x20000, 0x2FFFD, 2},
+	{0x30000, 0x3FFFD, 2},
 };
-- 
2.31.1

v2-0002-Allow-callers-mbbisearch-to-get-an-explicit-chara.patchapplication/octet-stream; name=v2-0002-Allow-callers-mbbisearch-to-get-an-explicit-chara.patchDownload
From 09ac998fdf6d169a7f7a4d053038e2f48205052f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Aug 2021 11:10:17 -0400
Subject: [PATCH v2 2/4] Allow callers of mbbisearch to get an explicit character
 width

Add width field to mbinterval and have mbbisearch return a
pointer to a range rather than just bool for success. A future
commit will add other widths besides zero, and this will allow
it to use the same search.
---
 .../unicode/generate-unicode_width_table.pl   |   6 +-
 src/common/wchar.c                            |  20 +-
 src/include/common/unicode_width_table.h      | 388 +++++++++---------
 3 files changed, 209 insertions(+), 205 deletions(-)

diff --git a/src/common/unicode/generate-unicode_width_table.pl b/src/common/unicode/generate-unicode_width_table.pl
index 86aed78907..0cf44b029c 100644
--- a/src/common/unicode/generate-unicode_width_table.pl
+++ b/src/common/unicode/generate-unicode_width_table.pl
@@ -15,9 +15,9 @@ my $prev_codepoint;
 my $count = 0;
 
 print
-  "/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */\n\n";
+  "/* generated by src/common/unicode/generate-unicode_width_table.pl, do not edit */\n\n";
 
-print "static const struct mbinterval combining[] = {\n";
+print "static const struct mbinterval wcwidth[] = {\n";
 
 foreach my $line (<ARGV>)
 {
@@ -40,7 +40,7 @@ foreach my $line (<ARGV>)
 		# not a combining character, print out previous range if any
 		if (defined($range_start))
 		{
-			printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_codepoint;
+			printf "\t{0x%04X, 0x%04X, 0},\n", $range_start, $prev_codepoint;
 			$range_start = undef;
 		}
 	}
diff --git a/src/common/wchar.c b/src/common/wchar.c
index bb97b5f54f..c0397ca139 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -585,17 +585,18 @@ struct mbinterval
 {
 	unsigned short first;
 	unsigned short last;
+	signed short width;
 };
 
 /* auxiliary function for binary search in interval table */
-static int
+static const struct mbinterval *
 mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
 {
 	int			min = 0;
 	int			mid;
 
 	if (ucs < table[0].first || ucs > table[max].last)
-		return 0;
+		return NULL;
 	while (max >= min)
 	{
 		mid = (min + max) / 2;
@@ -604,10 +605,10 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
 		else if (ucs < table[mid].first)
 			max = mid - 1;
 		else
-			return 1;
+			return &table[mid];
 	}
 
-	return 0;
+	return NULL;
 }
 
 
@@ -653,10 +654,13 @@ ucs_wcwidth(pg_wchar ucs)
 	if (ucs < 0x20 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff)
 		return -1;
 
-	/* binary search in table of non-spacing characters */
-	if (mbbisearch(ucs, combining,
-				   sizeof(combining) / sizeof(struct mbinterval) - 1))
-		return 0;
+	/* binary search in table of character widths */
+	const struct mbinterval *range =
+	mbbisearch(ucs, wcwidth,
+			   sizeof(wcwidth) / sizeof(struct mbinterval) - 1);
+
+	if (range != NULL)
+		return range->width;
 
 	/*
 	 * if we arrive here, ucs is not a combining or C0/C1 control character
diff --git a/src/include/common/unicode_width_table.h b/src/include/common/unicode_width_table.h
index a9f10c31bc..3b161f47a4 100644
--- a/src/include/common/unicode_width_table.h
+++ b/src/include/common/unicode_width_table.h
@@ -1,196 +1,196 @@
-/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */
+/* generated by src/common/unicode/generate-unicode_width_table.pl, do not edit */
 
-static const struct mbinterval combining[] = {
-	{0x0300, 0x036F},
-	{0x0483, 0x0489},
-	{0x0591, 0x05BD},
-	{0x05BF, 0x05BF},
-	{0x05C1, 0x05C2},
-	{0x05C4, 0x05C5},
-	{0x05C7, 0x05C7},
-	{0x0610, 0x061A},
-	{0x064B, 0x065F},
-	{0x0670, 0x0670},
-	{0x06D6, 0x06DC},
-	{0x06DF, 0x06E4},
-	{0x06E7, 0x06E8},
-	{0x06EA, 0x06ED},
-	{0x0711, 0x0711},
-	{0x0730, 0x074A},
-	{0x07A6, 0x07B0},
-	{0x07EB, 0x07F3},
-	{0x07FD, 0x07FD},
-	{0x0816, 0x0819},
-	{0x081B, 0x0823},
-	{0x0825, 0x0827},
-	{0x0829, 0x082D},
-	{0x0859, 0x085B},
-	{0x08D3, 0x08E1},
-	{0x08E3, 0x0902},
-	{0x093A, 0x093A},
-	{0x093C, 0x093C},
-	{0x0941, 0x0948},
-	{0x094D, 0x094D},
-	{0x0951, 0x0957},
-	{0x0962, 0x0963},
-	{0x0981, 0x0981},
-	{0x09BC, 0x09BC},
-	{0x09C1, 0x09C4},
-	{0x09CD, 0x09CD},
-	{0x09E2, 0x09E3},
-	{0x09FE, 0x0A02},
-	{0x0A3C, 0x0A3C},
-	{0x0A41, 0x0A51},
-	{0x0A70, 0x0A71},
-	{0x0A75, 0x0A75},
-	{0x0A81, 0x0A82},
-	{0x0ABC, 0x0ABC},
-	{0x0AC1, 0x0AC8},
-	{0x0ACD, 0x0ACD},
-	{0x0AE2, 0x0AE3},
-	{0x0AFA, 0x0B01},
-	{0x0B3C, 0x0B3C},
-	{0x0B3F, 0x0B3F},
-	{0x0B41, 0x0B44},
-	{0x0B4D, 0x0B56},
-	{0x0B62, 0x0B63},
-	{0x0B82, 0x0B82},
-	{0x0BC0, 0x0BC0},
-	{0x0BCD, 0x0BCD},
-	{0x0C00, 0x0C00},
-	{0x0C04, 0x0C04},
-	{0x0C3E, 0x0C40},
-	{0x0C46, 0x0C56},
-	{0x0C62, 0x0C63},
-	{0x0C81, 0x0C81},
-	{0x0CBC, 0x0CBC},
-	{0x0CBF, 0x0CBF},
-	{0x0CC6, 0x0CC6},
-	{0x0CCC, 0x0CCD},
-	{0x0CE2, 0x0CE3},
-	{0x0D00, 0x0D01},
-	{0x0D3B, 0x0D3C},
-	{0x0D41, 0x0D44},
-	{0x0D4D, 0x0D4D},
-	{0x0D62, 0x0D63},
-	{0x0D81, 0x0D81},
-	{0x0DCA, 0x0DCA},
-	{0x0DD2, 0x0DD6},
-	{0x0E31, 0x0E31},
-	{0x0E34, 0x0E3A},
-	{0x0E47, 0x0E4E},
-	{0x0EB1, 0x0EB1},
-	{0x0EB4, 0x0EBC},
-	{0x0EC8, 0x0ECD},
-	{0x0F18, 0x0F19},
-	{0x0F35, 0x0F35},
-	{0x0F37, 0x0F37},
-	{0x0F39, 0x0F39},
-	{0x0F71, 0x0F7E},
-	{0x0F80, 0x0F84},
-	{0x0F86, 0x0F87},
-	{0x0F8D, 0x0FBC},
-	{0x0FC6, 0x0FC6},
-	{0x102D, 0x1030},
-	{0x1032, 0x1037},
-	{0x1039, 0x103A},
-	{0x103D, 0x103E},
-	{0x1058, 0x1059},
-	{0x105E, 0x1060},
-	{0x1071, 0x1074},
-	{0x1082, 0x1082},
-	{0x1085, 0x1086},
-	{0x108D, 0x108D},
-	{0x109D, 0x109D},
-	{0x135D, 0x135F},
-	{0x1712, 0x1714},
-	{0x1732, 0x1734},
-	{0x1752, 0x1753},
-	{0x1772, 0x1773},
-	{0x17B4, 0x17B5},
-	{0x17B7, 0x17BD},
-	{0x17C6, 0x17C6},
-	{0x17C9, 0x17D3},
-	{0x17DD, 0x17DD},
-	{0x180B, 0x180D},
-	{0x1885, 0x1886},
-	{0x18A9, 0x18A9},
-	{0x1920, 0x1922},
-	{0x1927, 0x1928},
-	{0x1932, 0x1932},
-	{0x1939, 0x193B},
-	{0x1A17, 0x1A18},
-	{0x1A1B, 0x1A1B},
-	{0x1A56, 0x1A56},
-	{0x1A58, 0x1A60},
-	{0x1A62, 0x1A62},
-	{0x1A65, 0x1A6C},
-	{0x1A73, 0x1A7F},
-	{0x1AB0, 0x1B03},
-	{0x1B34, 0x1B34},
-	{0x1B36, 0x1B3A},
-	{0x1B3C, 0x1B3C},
-	{0x1B42, 0x1B42},
-	{0x1B6B, 0x1B73},
-	{0x1B80, 0x1B81},
-	{0x1BA2, 0x1BA5},
-	{0x1BA8, 0x1BA9},
-	{0x1BAB, 0x1BAD},
-	{0x1BE6, 0x1BE6},
-	{0x1BE8, 0x1BE9},
-	{0x1BED, 0x1BED},
-	{0x1BEF, 0x1BF1},
-	{0x1C2C, 0x1C33},
-	{0x1C36, 0x1C37},
-	{0x1CD0, 0x1CD2},
-	{0x1CD4, 0x1CE0},
-	{0x1CE2, 0x1CE8},
-	{0x1CED, 0x1CED},
-	{0x1CF4, 0x1CF4},
-	{0x1CF8, 0x1CF9},
-	{0x1DC0, 0x1DFF},
-	{0x20D0, 0x20F0},
-	{0x2CEF, 0x2CF1},
-	{0x2D7F, 0x2D7F},
-	{0x2DE0, 0x2DFF},
-	{0x302A, 0x302D},
-	{0x3099, 0x309A},
-	{0xA66F, 0xA672},
-	{0xA674, 0xA67D},
-	{0xA69E, 0xA69F},
-	{0xA6F0, 0xA6F1},
-	{0xA802, 0xA802},
-	{0xA806, 0xA806},
-	{0xA80B, 0xA80B},
-	{0xA825, 0xA826},
-	{0xA82C, 0xA82C},
-	{0xA8C4, 0xA8C5},
-	{0xA8E0, 0xA8F1},
-	{0xA8FF, 0xA8FF},
-	{0xA926, 0xA92D},
-	{0xA947, 0xA951},
-	{0xA980, 0xA982},
-	{0xA9B3, 0xA9B3},
-	{0xA9B6, 0xA9B9},
-	{0xA9BC, 0xA9BD},
-	{0xA9E5, 0xA9E5},
-	{0xAA29, 0xAA2E},
-	{0xAA31, 0xAA32},
-	{0xAA35, 0xAA36},
-	{0xAA43, 0xAA43},
-	{0xAA4C, 0xAA4C},
-	{0xAA7C, 0xAA7C},
-	{0xAAB0, 0xAAB0},
-	{0xAAB2, 0xAAB4},
-	{0xAAB7, 0xAAB8},
-	{0xAABE, 0xAABF},
-	{0xAAC1, 0xAAC1},
-	{0xAAEC, 0xAAED},
-	{0xAAF6, 0xAAF6},
-	{0xABE5, 0xABE5},
-	{0xABE8, 0xABE8},
-	{0xABED, 0xABED},
-	{0xFB1E, 0xFB1E},
-	{0xFE00, 0xFE0F},
-	{0xFE20, 0xFE2F},
+static const struct mbinterval wcwidth[] = {
+	{0x0300, 0x036F, 0},
+	{0x0483, 0x0489, 0},
+	{0x0591, 0x05BD, 0},
+	{0x05BF, 0x05BF, 0},
+	{0x05C1, 0x05C2, 0},
+	{0x05C4, 0x05C5, 0},
+	{0x05C7, 0x05C7, 0},
+	{0x0610, 0x061A, 0},
+	{0x064B, 0x065F, 0},
+	{0x0670, 0x0670, 0},
+	{0x06D6, 0x06DC, 0},
+	{0x06DF, 0x06E4, 0},
+	{0x06E7, 0x06E8, 0},
+	{0x06EA, 0x06ED, 0},
+	{0x0711, 0x0711, 0},
+	{0x0730, 0x074A, 0},
+	{0x07A6, 0x07B0, 0},
+	{0x07EB, 0x07F3, 0},
+	{0x07FD, 0x07FD, 0},
+	{0x0816, 0x0819, 0},
+	{0x081B, 0x0823, 0},
+	{0x0825, 0x0827, 0},
+	{0x0829, 0x082D, 0},
+	{0x0859, 0x085B, 0},
+	{0x08D3, 0x08E1, 0},
+	{0x08E3, 0x0902, 0},
+	{0x093A, 0x093A, 0},
+	{0x093C, 0x093C, 0},
+	{0x0941, 0x0948, 0},
+	{0x094D, 0x094D, 0},
+	{0x0951, 0x0957, 0},
+	{0x0962, 0x0963, 0},
+	{0x0981, 0x0981, 0},
+	{0x09BC, 0x09BC, 0},
+	{0x09C1, 0x09C4, 0},
+	{0x09CD, 0x09CD, 0},
+	{0x09E2, 0x09E3, 0},
+	{0x09FE, 0x0A02, 0},
+	{0x0A3C, 0x0A3C, 0},
+	{0x0A41, 0x0A51, 0},
+	{0x0A70, 0x0A71, 0},
+	{0x0A75, 0x0A75, 0},
+	{0x0A81, 0x0A82, 0},
+	{0x0ABC, 0x0ABC, 0},
+	{0x0AC1, 0x0AC8, 0},
+	{0x0ACD, 0x0ACD, 0},
+	{0x0AE2, 0x0AE3, 0},
+	{0x0AFA, 0x0B01, 0},
+	{0x0B3C, 0x0B3C, 0},
+	{0x0B3F, 0x0B3F, 0},
+	{0x0B41, 0x0B44, 0},
+	{0x0B4D, 0x0B56, 0},
+	{0x0B62, 0x0B63, 0},
+	{0x0B82, 0x0B82, 0},
+	{0x0BC0, 0x0BC0, 0},
+	{0x0BCD, 0x0BCD, 0},
+	{0x0C00, 0x0C00, 0},
+	{0x0C04, 0x0C04, 0},
+	{0x0C3E, 0x0C40, 0},
+	{0x0C46, 0x0C56, 0},
+	{0x0C62, 0x0C63, 0},
+	{0x0C81, 0x0C81, 0},
+	{0x0CBC, 0x0CBC, 0},
+	{0x0CBF, 0x0CBF, 0},
+	{0x0CC6, 0x0CC6, 0},
+	{0x0CCC, 0x0CCD, 0},
+	{0x0CE2, 0x0CE3, 0},
+	{0x0D00, 0x0D01, 0},
+	{0x0D3B, 0x0D3C, 0},
+	{0x0D41, 0x0D44, 0},
+	{0x0D4D, 0x0D4D, 0},
+	{0x0D62, 0x0D63, 0},
+	{0x0D81, 0x0D81, 0},
+	{0x0DCA, 0x0DCA, 0},
+	{0x0DD2, 0x0DD6, 0},
+	{0x0E31, 0x0E31, 0},
+	{0x0E34, 0x0E3A, 0},
+	{0x0E47, 0x0E4E, 0},
+	{0x0EB1, 0x0EB1, 0},
+	{0x0EB4, 0x0EBC, 0},
+	{0x0EC8, 0x0ECD, 0},
+	{0x0F18, 0x0F19, 0},
+	{0x0F35, 0x0F35, 0},
+	{0x0F37, 0x0F37, 0},
+	{0x0F39, 0x0F39, 0},
+	{0x0F71, 0x0F7E, 0},
+	{0x0F80, 0x0F84, 0},
+	{0x0F86, 0x0F87, 0},
+	{0x0F8D, 0x0FBC, 0},
+	{0x0FC6, 0x0FC6, 0},
+	{0x102D, 0x1030, 0},
+	{0x1032, 0x1037, 0},
+	{0x1039, 0x103A, 0},
+	{0x103D, 0x103E, 0},
+	{0x1058, 0x1059, 0},
+	{0x105E, 0x1060, 0},
+	{0x1071, 0x1074, 0},
+	{0x1082, 0x1082, 0},
+	{0x1085, 0x1086, 0},
+	{0x108D, 0x108D, 0},
+	{0x109D, 0x109D, 0},
+	{0x135D, 0x135F, 0},
+	{0x1712, 0x1714, 0},
+	{0x1732, 0x1734, 0},
+	{0x1752, 0x1753, 0},
+	{0x1772, 0x1773, 0},
+	{0x17B4, 0x17B5, 0},
+	{0x17B7, 0x17BD, 0},
+	{0x17C6, 0x17C6, 0},
+	{0x17C9, 0x17D3, 0},
+	{0x17DD, 0x17DD, 0},
+	{0x180B, 0x180D, 0},
+	{0x1885, 0x1886, 0},
+	{0x18A9, 0x18A9, 0},
+	{0x1920, 0x1922, 0},
+	{0x1927, 0x1928, 0},
+	{0x1932, 0x1932, 0},
+	{0x1939, 0x193B, 0},
+	{0x1A17, 0x1A18, 0},
+	{0x1A1B, 0x1A1B, 0},
+	{0x1A56, 0x1A56, 0},
+	{0x1A58, 0x1A60, 0},
+	{0x1A62, 0x1A62, 0},
+	{0x1A65, 0x1A6C, 0},
+	{0x1A73, 0x1A7F, 0},
+	{0x1AB0, 0x1B03, 0},
+	{0x1B34, 0x1B34, 0},
+	{0x1B36, 0x1B3A, 0},
+	{0x1B3C, 0x1B3C, 0},
+	{0x1B42, 0x1B42, 0},
+	{0x1B6B, 0x1B73, 0},
+	{0x1B80, 0x1B81, 0},
+	{0x1BA2, 0x1BA5, 0},
+	{0x1BA8, 0x1BA9, 0},
+	{0x1BAB, 0x1BAD, 0},
+	{0x1BE6, 0x1BE6, 0},
+	{0x1BE8, 0x1BE9, 0},
+	{0x1BED, 0x1BED, 0},
+	{0x1BEF, 0x1BF1, 0},
+	{0x1C2C, 0x1C33, 0},
+	{0x1C36, 0x1C37, 0},
+	{0x1CD0, 0x1CD2, 0},
+	{0x1CD4, 0x1CE0, 0},
+	{0x1CE2, 0x1CE8, 0},
+	{0x1CED, 0x1CED, 0},
+	{0x1CF4, 0x1CF4, 0},
+	{0x1CF8, 0x1CF9, 0},
+	{0x1DC0, 0x1DFF, 0},
+	{0x20D0, 0x20F0, 0},
+	{0x2CEF, 0x2CF1, 0},
+	{0x2D7F, 0x2D7F, 0},
+	{0x2DE0, 0x2DFF, 0},
+	{0x302A, 0x302D, 0},
+	{0x3099, 0x309A, 0},
+	{0xA66F, 0xA672, 0},
+	{0xA674, 0xA67D, 0},
+	{0xA69E, 0xA69F, 0},
+	{0xA6F0, 0xA6F1, 0},
+	{0xA802, 0xA802, 0},
+	{0xA806, 0xA806, 0},
+	{0xA80B, 0xA80B, 0},
+	{0xA825, 0xA826, 0},
+	{0xA82C, 0xA82C, 0},
+	{0xA8C4, 0xA8C5, 0},
+	{0xA8E0, 0xA8F1, 0},
+	{0xA8FF, 0xA8FF, 0},
+	{0xA926, 0xA92D, 0},
+	{0xA947, 0xA951, 0},
+	{0xA980, 0xA982, 0},
+	{0xA9B3, 0xA9B3, 0},
+	{0xA9B6, 0xA9B9, 0},
+	{0xA9BC, 0xA9BD, 0},
+	{0xA9E5, 0xA9E5, 0},
+	{0xAA29, 0xAA2E, 0},
+	{0xAA31, 0xAA32, 0},
+	{0xAA35, 0xAA36, 0},
+	{0xAA43, 0xAA43, 0},
+	{0xAA4C, 0xAA4C, 0},
+	{0xAA7C, 0xAA7C, 0},
+	{0xAAB0, 0xAAB0, 0},
+	{0xAAB2, 0xAAB4, 0},
+	{0xAAB7, 0xAAB8, 0},
+	{0xAABE, 0xAABF, 0},
+	{0xAAC1, 0xAAC1, 0},
+	{0xAAEC, 0xAAED, 0},
+	{0xAAF6, 0xAAF6, 0},
+	{0xABE5, 0xABE5, 0},
+	{0xABE8, 0xABE8, 0},
+	{0xABED, 0xABED, 0},
+	{0xFB1E, 0xFB1E, 0},
+	{0xFE00, 0xFE0F, 0},
+	{0xFE20, 0xFE2F, 0},
 };
-- 
2.31.1

#31John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#30)
Re: badly calculated width of emoji in psql

I plan to commit my proposed v2 this week unless I hear reservations about
my adjustments, or bikeshedding. I'm thinking of squashing 0001 and 0002.

--
John Naylor
EDB: http://www.enterprisedb.com

#32Jacob Champion
pchampion@vmware.com
In reply to: John Naylor (#30)
Re: badly calculated width of emoji in psql

On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote:

On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion@vmware.com> wrote:

I guess it just depends on what the end result looks/performs like.
We'd save seven hops or so in the worst case?

Something like that. Attached is what I had in mind (using real
patches to see what the CF bot thinks):

0001 is a simple renaming
0002 puts the char width inside the mbinterval so we can put arbitrary values there

0002 introduces a mixed declarations/statements warning for
ucs_wcwidth(). Other than that, LGTM overall.

--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s)
struct mbinterval
{
-   unsigned short first;
-   unsigned short last;
-   signed short width;
+   unsigned int first;
+   unsigned int last:21;
+   signed int  width:4;
};

Oh, right -- my patch moved mbinterval from short to int, but should I
have used uint32 instead? It would only matter in theory for the
`first` member now that the bitfields are there.

I think the adjustments to 0003 result in a cleaner and more
extensible design, but a case could be made otherwise. The former
combining table script is a bit more complex than the sum of its
former self and Jacob's proposed new script, but just slightly.

The microbenchmark says it's also more performant, so +1 to your
version.

Does there need to be any sanity check for overlapping ranges between
the combining and fullwidth sets? The Unicode data on a dev's machine
would have to be broken somehow for that to happen, but it could
potentially go undetected for a while if it did.

Also, I checked the behavior of this comment that I proposed to remove upthread:

- * - Other format characters (general category code Cf in the Unicode
- * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.

We don't handle the latter in our current setup:

SELECT U&'foo\200Bbar';
+----------+
| ?column? |
+----------+
| foobar |
+----------+
(1 row)

Not sure if we should do anything about this. It was an explicit
exception years ago in our vendored manual table, but is not labeled
as such in the official Unicode files.

I'm wary of changing too many things at once, but it does seem like we
should be giving that codepoint a width of 0.

On Tue, 2021-08-24 at 12:05 -0400, John Naylor wrote:

I plan to commit my proposed v2 this week unless I hear reservations
about my adjustments, or bikeshedding. I'm thinking of squashing 0001
and 0002.

+1

Thanks!
--Jacob

#33John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#32)
Re: badly calculated width of emoji in psql

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion@vmware.com> wrote:

On Fri, 2021-08-20 at 13:05 -0400, John Naylor wrote:

On Thu, Aug 19, 2021 at 8:05 PM Jacob Champion <pchampion@vmware.com>

wrote:

I guess it just depends on what the end result looks/performs like.
We'd save seven hops or so in the worst case?

Something like that. Attached is what I had in mind (using real
patches to see what the CF bot thinks):

0001 is a simple renaming
0002 puts the char width inside the mbinterval so we can put arbitrary

values there

0002 introduces a mixed declarations/statements warning for
ucs_wcwidth(). Other than that, LGTM overall.

I didn't see that warning with clang 12, either with or without assertions,
but I do see it on gcc 11. Fixed, and pushed 0001 and 0002. I decided
against squashing them, since my original instinct was correct -- the
header changes too much for git to consider it the same file, which may
make archeology harder.

--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -583,9 +583,9 @@ pg_utf_mblen(const unsigned char *s)
struct mbinterval
{
-   unsigned short first;
-   unsigned short last;
-   signed short width;
+   unsigned int first;
+   unsigned int last:21;
+   signed int  width:4;
};

Oh, right -- my patch moved mbinterval from short to int, but should I
have used uint32 instead? It would only matter in theory for the
`first` member now that the bitfields are there.

I'm not sure it would matter, but the usual type for codepoints is unsigned.

I think the adjustments to 0003 result in a cleaner and more
extensible design, but a case could be made otherwise. The former
combining table script is a bit more complex than the sum of its
former self and Jacob's proposed new script, but just slightly.

The microbenchmark says it's also more performant, so +1 to your
version.

Does there need to be any sanity check for overlapping ranges between
the combining and fullwidth sets? The Unicode data on a dev's machine
would have to be broken somehow for that to happen, but it could
potentially go undetected for a while if it did.

Thanks for testing again! The sanity check sounds like a good idea, so I'll
work on that and push soon.

--
John Naylor
EDB: http://www.enterprisedb.com

#34John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#32)
Re: badly calculated width of emoji in psql

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion@vmware.com> wrote:

Does there need to be any sanity check for overlapping ranges between
the combining and fullwidth sets? The Unicode data on a dev's machine
would have to be broken somehow for that to happen, but it could
potentially go undetected for a while if it did.

It turns out I should have done that to begin with. In the Unicode data, it
apparently happens that a character can be both combining and wide, and
that will cause ranges to overlap in my scheme:

302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
ENTERING TONE MARK

{0x3000, 0x303E, 2},
{0x302A, 0x302D, 0},

3099..309A;W # Mn [2] COMBINING KATAKANA-HIRAGANA VOICED SOUND
MARK..COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

{0x3099, 0x309A, 0},
{0x3099, 0x30FF, 2},

Going by the above, Jacob's patch from July 21 just happened to be correct
by chance since the combining character search happened first.

It seems the logical thing to do is revert my 0001 and 0002 and go back to
something much closer to Jacob's patch, plus a big comment explaining that
the order in which the searches happen matters.

The EastAsianWidth.txt does have combining property "Mn" in the comment
above, so it's tempting to just read that (plus we could read just one file
for these properties). However, it seems risky to rely on comments, since
their presence and format is probably less stable than the data format.
--
John Naylor
EDB: http://www.enterprisedb.com

#35John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#34)
Re: badly calculated width of emoji in psql

I wrote:

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion@vmware.com>

wrote:

It seems the logical thing to do is revert my 0001 and 0002 and go back

to something much closer to Jacob's patch, plus a big comment explaining
that the order in which the searches happen matters.

I pushed Jacob's patch with the addendum I shared upthread, plus a comment
about search order. I also took the liberty of changing the author in the
CF app to Jacob. Later I'll push detecting non-spacing characters beyond
the BMP.
--
John Naylor
EDB: http://www.enterprisedb.com

#36Jacob Champion
pchampion@vmware.com
In reply to: John Naylor (#34)
Re: badly calculated width of emoji in psql

On Wed, 2021-08-25 at 16:15 -0400, John Naylor wrote:

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion@vmware.com> wrote:

Does there need to be any sanity check for overlapping ranges between
the combining and fullwidth sets? The Unicode data on a dev's machine
would have to be broken somehow for that to happen, but it could
potentially go undetected for a while if it did.

It turns out I should have done that to begin with. In the Unicode
data, it apparently happens that a character can be both combining
and wide, and that will cause ranges to overlap in my scheme:

I was looking for overlaps in my review, but I skipped right over that,
sorry...

On Thu, 2021-08-26 at 11:12 -0400, John Naylor wrote:

I pushed Jacob's patch with the addendum I shared upthread, plus a
comment about search order. I also took the liberty of changing the
author in the CF app to Jacob. Later I'll push detecting non-spacing
characters beyond the BMP.

Thanks!

--Jacob

#37Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#36)
Re: badly calculated width of emoji in psql

čt 26. 8. 2021 v 17:25 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Wed, 2021-08-25 at 16:15 -0400, John Naylor wrote:

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion@vmware.com>

wrote:

Does there need to be any sanity check for overlapping ranges between
the combining and fullwidth sets? The Unicode data on a dev's machine
would have to be broken somehow for that to happen, but it could
potentially go undetected for a while if it did.

It turns out I should have done that to begin with. In the Unicode
data, it apparently happens that a character can be both combining
and wide, and that will cause ranges to overlap in my scheme:

I was looking for overlaps in my review, but I skipped right over that,
sorry...

On Thu, 2021-08-26 at 11:12 -0400, John Naylor wrote:

I pushed Jacob's patch with the addendum I shared upthread, plus a
comment about search order. I also took the liberty of changing the
author in the CF app to Jacob. Later I'll push detecting non-spacing
characters beyond the BMP.

Thanks!

Great, thanks

Pavel

Show quoted text

--Jacob