badly calculated width of emoji in psql

Started by Pavel Stehuleabout 5 years ago37 messageshackers
Jump to latest
#1Pavel Stehule
pavel.stehule@gmail.com

Hi

I am checked an query from
https://www.depesz.com/2021/04/01/waiting-for-postgresql-14-add-unistr-function/
article.

postgres=# SELECT U&'\+01F603';
┌──────────┐
│ ?column? │
╞══════════╡
│ 😃 │
└──────────┘
(1 row)

The result is not correct. Emoji has width 2 chars, but psql calculates
with just one char.

Regards

Pavel

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Pavel Stehule (#1)
Re: badly calculated width of emoji in psql

On Fri, 2021-04-02 at 10:45 +0200, Pavel Stehule wrote:

I am checked an query from https://www.depesz.com/2021/04/01/waiting-for-postgresql-14-add-unistr-function/ article.

postgres=# SELECT U&'\+01F603';
┌──────────┐
│ ?column? │
╞══════════╡
│ 😃 │
└──────────┘
(1 row)

The result is not correct. Emoji has width 2 chars, but psql calculates with just one char.

How about this:

diff --git a/src/common/wchar.c b/src/common/wchar.c
index 6e7d731e02..e2d0d9691c 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -673,7 +673,8 @@ ucs_wcwidth(pg_wchar ucs)
 		  (ucs >= 0xfe30 && ucs <= 0xfe6f) ||	/* CJK Compatibility Forms */
 		  (ucs >= 0xff00 && ucs <= 0xff5f) ||	/* Fullwidth Forms */
 		  (ucs >= 0xffe0 && ucs <= 0xffe6) ||
-		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
+		  (ucs >= 0x20000 && ucs <= 0x2ffff) ||
+		  (ucs >= 0x1f300 && ucs <= 0x1faff)));	/* symbols and emojis */
 }

/*

This is guesswork based on the unicode entries that look like symbols.

Yours,
Laurenz Albe

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Laurenz Albe (#2)
Re: badly calculated width of emoji in psql

pá 2. 4. 2021 v 11:37 odesílatel Laurenz Albe <laurenz.albe@cybertec.at>
napsal:

On Fri, 2021-04-02 at 10:45 +0200, Pavel Stehule wrote:

I am checked an query from

https://www.depesz.com/2021/04/01/waiting-for-postgresql-14-add-unistr-function/
article.

postgres=# SELECT U&'\+01F603';
┌──────────┐
│ ?column? │
╞══════════╡
│ 😃 │
└──────────┘
(1 row)

The result is not correct. Emoji has width 2 chars, but psql calculates

with just one char.

How about this:

diff --git a/src/common/wchar.c b/src/common/wchar.c
index 6e7d731e02..e2d0d9691c 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -673,7 +673,8 @@ ucs_wcwidth(pg_wchar ucs)
(ucs >= 0xfe30 && ucs <= 0xfe6f) ||   /* CJK
Compatibility Forms */
(ucs >= 0xff00 && ucs <= 0xff5f) ||   /* Fullwidth Forms
*/
(ucs >= 0xffe0 && ucs <= 0xffe6) ||
-                 (ucs >= 0x20000 && ucs <= 0x2ffff)));
+                 (ucs >= 0x20000 && ucs <= 0x2ffff) ||
+                 (ucs >= 0x1f300 && ucs <= 0x1faff))); /* symbols and
emojis */
}

/*

This is guesswork based on the unicode entries that look like symbols.

it helps

with this patch, the formatting is correct

Pavel

Show quoted text

Yours,
Laurenz Albe

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Pavel Stehule (#3)
Re: badly calculated width of emoji in psql

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1]http://www.unicode.org/reports/tr51/ says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2]https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

[1]: http://www.unicode.org/reports/tr51/
[2]: https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: badly calculated width of emoji in psql

po 5. 4. 2021 v 7:07 odesílatel Kyotaro Horiguchi <horikyota.ntt@gmail.com>
napsal:

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com>
wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1] says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2] asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

[1] http://www.unicode.org/reports/tr51/
[2] https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

ok

It should be fixed in glibc,

https://sourceware.org/bugzilla/show_bug.cgi?id=20313

so we can check it

Regards

Pavel

Show quoted text

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Kyotaro Horiguchi (#4)
Re: badly calculated width of emoji in psql

On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1] says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2] asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

That is strange.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

--Jacob

#7Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#6)
Re: badly calculated width of emoji in psql

st 7. 7. 2021 v 20:03 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Mon, 2021-04-05 at 14:07 +0900, Kyotaro Horiguchi wrote:

At Fri, 2 Apr 2021 11:51:26 +0200, Pavel Stehule <

pavel.stehule@gmail.com> wrote in

with this patch, the formatting is correct

I think the hardest point of this issue is that we don't have a
reasonable authoritative source that determines character width. And
that the presentation is heavily dependent on environment.

Unicode 9 and/or 10 defines the character properties "Emoji" and
"Emoji_Presentation", and tr51[1] says that

Emoji are generally presented with a square aspect ratio, which
presents a problem for flags.

...

Current practice is for emoji to have a square aspect ratio, deriving
from their origin in Japanese. For interoperability, it is recommended
that this practice be continued with current and future emoji. They
will typically have about the same vertical placement and advance
width as CJK ideographs. For example:

Ok, even putting aside flags, the first table in [2] asserts that "#",
"*", "0-9" are emoji characters. But we and I think no-one never
present them in two-columns. And the table has many mysterious holes
I haven't looked into.

I think that's why Emoji_Presentation is false for those characters --
they _could_ be presented as emoji if the UI should choose to do so, or
if an emoji presentation selector is used, but by default a text
presentation would be expected.

We could Emoji_Presentation=yes for the purpose, but for example,
U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE) has the property
Emoji_Presentation=yes but U+23E9(BLACK RIGHT-POINTING DOUBLE TRIANGLE
WITH VERTICAL BAR) does not for a reason uncertaion to me. It doesn't
look like other than some kind of mistake.

That is strange.

About environment, for example, U+23E9 is an emoji, and
Emoji_Presentation=yes, but it is shown in one column on my
xterm. (I'm not sure what font am I using..)

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

Furthermore, if the stream contains an emoji presentation selector
after a code point that would normally be text, shouldn't we change
that glyph to have an emoji "expected width"?

I'm wondering if the most correct solution would be to have the user
tell the client what width to use, using .psqlrc or something.

Gnome terminal does it - VTE does it - there is option how to display chars
with not well specified width.

A possible compromise is that we treat all Emoji=yes characters
excluding ASCII characters as double-width and manually merge the
fragmented regions into reasonably larger chunks.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

--

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

We can check how these chars are printed in most common terminals in modern
versions. I am afraid that it can be problematic to find a solution that
works everywhere, because some libraries on some platforms are pretty
obsolete.

Regards

Pavel

Show quoted text

--Jacob

#8Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#6)
Re: badly calculated width of emoji in psql

On Wed, Jul 07, 2021 at 06:03:34PM +0000, Jacob Champion wrote:

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

I'd say that we are doing our best in guessing what it should be,
then. One cannot predict how fonts are designed.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

Hmm. Such things have a cost, and this one sounds costly with a
limited impact. What do we gain except a better visibility with psql?

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html
--
Michael

#9Pavel Stehule
pavel.stehule@gmail.com
In reply to: Michael Paquier (#8)
Re: badly calculated width of emoji in psql

po 19. 7. 2021 v 9:46 odesílatel Michael Paquier <michael@paquier.xyz>
napsal:

On Wed, Jul 07, 2021 at 06:03:34PM +0000, Jacob Champion wrote:

I would guess that's the key issue here. If we choose a particular
width for emoji characters, is there anything keeping a terminal's font
from doing something different anyway?

I'd say that we are doing our best in guessing what it should be,
then. One cannot predict how fonts are designed.

We could also keep the fragments as-is and generate a full interval
table, like common/unicode_combining_table.h. It looks like there's
roughly double the number of emoji intervals as combining intervals, so
hopefully adding a second binary search wouldn't be noticeably slower.

Hmm. Such things have a cost, and this one sounds costly with a
limited impact. What do we gain except a better visibility with psql?

The benefit is correct displaying. I checked impact on server side, and
ucs_wcwidth is used just for calculation of error position. Any other usage
is only in psql.

Moreover, I checked unicode ranges, and I think so for common languages the
performance impact should be zero (because typically use ucs < 0x1100). The
possible (but very low) impact can be for some historic languages or
special symbols. It has not any impact for ranges that currently return
display width 2, because the new range is at the end of list.

I am not sure how wide usage of PQdsplen is outside psql, but I have no
reason to think so, so developers will prefer this function over built
functionality in any developing environment that supports unicode. So in
this case I have a strong opinion to prefer correctness of result against
current speed (note: I have an experience from pspg development, where this
operation is really on critical path, and I tried do some micro
optimization without strong effect - on very big unusual result (very wide,
very long (100K rows) the difference was about 500 ms (on pager side, it
does nothing else than string operations in this moment)).

Regards

Pavel

Show quoted text

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html
--
Michael

#10Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Paquier (#8)
Re: badly calculated width of emoji in psql

On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
http://www.unicode.org/emoji/charts/full-emoji-list.html

That could be adapted; the question is if the approach as such is
desirable or not. This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

My personal feeling is that something simple and perhaps imperfect
as my one-liner that may miss some corner cases would be ok, but
anything that saps more performance or is complicated would not
be worth the effort.

Yours,
Laurenz Albe

#11Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Laurenz Albe (#10)
Re: badly calculated width of emoji in psql

On Mon, 2021-07-19 at 13:13 +0200, Laurenz Albe wrote:

On Mon, 2021-07-19 at 16:46 +0900, Michael Paquier wrote:

In your opinion, would the current one-line patch proposal make things
strictly better than they are today, or would it have mixed results?
I'm wondering how to help this patch move forward for the current
commitfest, or if we should maybe return with feedback for now.

Based on the following list, it seems to me that [u+1f300,u+0x1faff]
won't capture everything, like the country flags:
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2Femoji%2Fcharts%2Ffull-emoji-list.html&amp;amp;data=04%7C01%7Cpchampion%40vmware.com%7Cbc3f4cff42094f60fa7708d94aa64f11%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637622900429154586%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;amp;sdata=lfSsqU%2BEiSJrwftt9FL13ib7pw0Mzt5DYl%2BSjL2%2Bm%2F0%3D&amp;amp;reserved=0

On my machine, the regional indicator codes take up one column each
(even though they display as a wide uppercase letter), so making them
wide would break alignment. This seems to match up with Annex #11 [1]http://www.unicode.org/reports/tr11/:

ED4. East Asian Wide (W): All other characters that are always
wide. [...] This category includes characters that have explicit
halfwidth counterparts, along with characters that have the [UTS51]
property Emoji_Presentation, with the exception of characters that
have the [UCD] property Regional_Indicator

So for whatever reason, those indicator codes aren't considered East
Asian Wide by Unicode (and therefore glibc), even though they are
Emoji_Presentation. And glibc appears to be using East Asian Wide as
the flag for a 2-column character.

glibc 2.31 is based on Unicode 12.1, I think. So if Postgres is built
against a Unicode database that's different from the system's,
obviously you'll see odd results no matter what we do here.

And _all_ of that completely ignores the actual country-flag-combining
behavior, which my terminal doesn't do and I assume would be part of a
separate conversation entirely, along with things like ZWJ sequences.

That could be adapted; the question is if the approach as such is
desirable or not. This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

Sure. We already have code in the tree that deals with that moving
target, though, by parsing apart pieces of the Unicode database. So the
added maintenance cost should be pretty low.

My personal feeling is that something simple and perhaps imperfect
as my one-liner that may miss some corner cases would be ok, but
anything that saps more performance or is complicated would not
be worth the effort.

Another data point: on my machine (Ubuntu 20.04, glibc 2.31) that
additional range not only misses a large number of emoji (e.g. in the
2xxx codepoint range), it incorrectly treats some narrow codepoints as
wide (e.g. many in the 1F32x range have Emoji_Presentation set to
false).

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

--Jacob

[1]: http://www.unicode.org/reports/tr11/

#12Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#11)
Re: badly calculated width of emoji in psql

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

On Mon, 2021-07-19 at 13:13 +0200, Laurenz Albe wrote:

That could be adapted; the question is if the approach as such is
desirable or not. This is necessarily a moving target, at the rate
that emojis are created and added to Unicode.

Sure. We already have code in the tree that deals with that moving
target, though, by parsing apart pieces of the Unicode database. So the
added maintenance cost should be pretty low.

(I am working on such a patch today and will report back.)

--Jacob

#13Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#11)
Re: badly calculated width of emoji in psql

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

--Jacob

Attachments:

ucs_wcwidth-update-Fullwidth-and-Wide-codepoint-set.patchtext/x-patch; name=ucs_wcwidth-update-Fullwidth-and-Wide-codepoint-set.patchDownload+208-17
#14Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#13)
Re: badly calculated width of emoji in psql

Hi

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

I can confirm that the original issue is fixed.

I tested performance

I had three data sets

1. typical data - mix ascii and utf characters typical for czech language -
25K lines - there is very small slowdown 2ms from 24 to 26ms (stored file
of this result has 3MB)

2. the worst case - this reports has only emoji 1000 chars * 10K rows -
there is more significant slowdown - from 160 ms to 220 ms (stored file has
39MB)

3. a little bit of obscure datasets generated by \x and select * from
pg_proc - it has 99K lines - and there are a lot of unicode decorations
(borders). The line has 17K chars. (the stored file has 1.7GB)
In this dataset I see a slowdown from 4300 to 4700 ms.

In all cases, the data are in memory (in filesystem cache). I tested load
to pspg.

9% looks too high, but in absolute time it is 400ms for 99K lines and very
untypical data, or 2ms for more typical results., 2ms are nothing (for
interactive work). More - this is from a pspg perspective. In psql there
can be overhead of network, protocol processing, formatting, and more and
more, and psql doesn't need to calculate display width of decorations
(borders), what is the reason for slowdowns in pspg.

Pavel

Show quoted text

--Jacob

#15Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Pavel Stehule (#14)
Re: badly calculated width of emoji in psql

On Fri, 2021-07-23 at 17:42 +0200, Pavel Stehule wrote:

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion@vmware.com> napsal:

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

I can confirm that the original issue is fixed.

Great!

I tested performance

I had three data sets

1. typical data - mix ascii and utf characters typical for czech
language - 25K lines - there is very small slowdown 2ms from 24 to
26ms (stored file of this result has 3MB)

2. the worst case - this reports has only emoji 1000 chars * 10K rows
- there is more significant slowdown - from 160 ms to 220 ms (stored
file has 39MB)

I assume the stored file size has grown with this patch, since we're
now printing the correct number of spacing/border characters?

3. a little bit of obscure datasets generated by \x and select * from
pg_proc - it has 99K lines - and there are a lot of unicode
decorations (borders). The line has 17K chars. (the stored file has
1.7GB)
In this dataset I see a slowdown from 4300 to 4700 ms.

These results are lining up fairly well with my profiling. To isolate
the effects of the new algorithm (as opposed to printing time) I
redirected to /dev/null:

psql postgres -c "select repeat(unistr('\u115D'), 10000000);" > /dev/null

This is what I expect to be the worst case for the new patch: a huge
string consisting of nothing but \u115D, which is in the first interval
and therefore takes the maximum number of iterations during the binary
search.

For that command, the wall time slowdown with this patch was about
240ms (from 1.128s to 1.366s, or 21% slower). Callgrind shows an
increase of 18% in the number of instructions executed with the
interval table patch, all of it coming from PQdsplen (no surprise).
PQdsplen itself has a 36% increase in instruction count for that run.

I also did a microbenchmark of PQdsplen (code attached, requires Google
Benchmark [1]https://github.com/google/benchmark). The three cases I tested were standard ASCII
characters, a smiley-face emoji, and the worst-case \u115F character.

Without the patch:

------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
...
BM_Ascii_mean 4.97 ns 4.97 ns 5
BM_Ascii_median 4.97 ns 4.97 ns 5
BM_Ascii_stddev 0.035 ns 0.035 ns 5
...
BM_Emoji_mean 6.30 ns 6.30 ns 5
BM_Emoji_median 6.30 ns 6.30 ns 5
BM_Emoji_stddev 0.045 ns 0.045 ns 5
...
BM_Worst_mean 12.4 ns 12.4 ns 5
BM_Worst_median 12.4 ns 12.4 ns 5
BM_Worst_stddev 0.038 ns 0.038 ns 5

With the patch:

------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
...
BM_Ascii_mean 4.59 ns 4.59 ns 5
BM_Ascii_median 4.60 ns 4.60 ns 5
BM_Ascii_stddev 0.069 ns 0.068 ns 5
...
BM_Emoji_mean 11.8 ns 11.8 ns 5
BM_Emoji_median 11.8 ns 11.8 ns 5
BM_Emoji_stddev 0.059 ns 0.059 ns 5
...
BM_Worst_mean 18.5 ns 18.5 ns 5
BM_Worst_median 18.5 ns 18.5 ns 5
BM_Worst_stddev 0.077 ns 0.077 ns 5

So an incredibly tiny improvement in the ASCII case, which is
reproducible across multiple runs and not just a fluke (I assume
because the code is smaller now and has better cache line
characteristics?). A ~90% slowdown for the emoji case, and a ~50%
slowdown for the worst-performing characters. That seems perfectly
reasonable considering we're talking about dozens of nanoseconds.

9% looks too high, but in absolute time it is 400ms for 99K lines and
very untypical data, or 2ms for more typical results., 2ms are
nothing (for interactive work). More - this is from a pspg
perspective. In psql there can be overhead of network, protocol
processing, formatting, and more and more, and psql doesn't need to
calculate display width of decorations (borders), what is the reason
for slowdowns in pspg.

Yeah. Considering the alignment code is for user display, the absolute
performance is going to dominate, and I don't see any red flags so far.
If you're regularly dealing with unbelievably huge amounts of emoji, I
think the amount of extra time we're seeing here is unlikely to be a
problem. If it is, you can always turn alignment off. (Do you rely on
horizontal alignment for lines with millions of characters, anyway?)

Laurenz, Michael, what do you think?

--Jacob

[1]: https://github.com/google/benchmark

Attachments:

main.cpptext/x-c++src; name=main.cppDownload
Makefiletext/x-makefile; name=MakefileDownload
#16Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#13)
Re: badly calculated width of emoji in psql

čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion <pchampion@vmware.com>
napsal:

On Wed, 2021-07-21 at 00:08 +0000, Jacob Champion wrote:

I note that the doc comment for ucs_wcwidth()...

* - Spacing characters in the East Asian Wide (W) or East Asian
* FullWidth (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.

...doesn't match reality anymore. The East Asian width handling was
last updated in 2006, it looks like? So I wonder whether fixing the
code to match the comment would not only fix the emoji problem but also
a bunch of other non-emoji characters.

Attached is my attempt at that. This adds a second interval table,
handling not only the emoji range in the original patch but also
correcting several non-emoji character ranges which are included in the
13.0 East Asian Wide/Fullwidth sets. Try for example

- U+2329 LEFT POINTING ANGLE BRACKET
- U+16FE0 TANGUT ITERATION MARK
- U+18000 KATAKANA LETTER ARCHAIC E

This should work reasonably well for terminals that depend on modern
versions of Unicode's EastAsianWidth.txt to figure out character width.
I don't know how it behaves on BSD libc or Windows.

The new binary search isn't free, but my naive attempt at measuring the
performance hit made it look worse than it actually is. Since the
measurement function was previously returning an incorrect (too short)
width, we used to get a free performance boost by not printing the
correct number of alignment/border characters. I'm still trying to
figure out how best to isolate the performance changes due to this
patch.

Pavel, I'd be interested to see what your benchmarks find with this
code. Does this fix the original issue for you?

This patch fixed badly formatted tables with emoji.

I checked this patch, and it is correct and a step forward, because it
dynamically sets intervals of double wide characters, and the code is more
readable.

I checked and performance, and although there is measurable slowdown, it is
negligible in absolute values. Previous code was a little bit faster - it
checked less ranges, but was not fully correct and up to date.

The patching was without problems
There are no regress tests, but I am not sure so they are necessary for
this case.
make check-world passed without problems

I'll mark this patch as ready for committer

Regards

Pavel

Show quoted text

--Jacob

#17John Naylor
john.naylor@enterprisedb.com
In reply to: Pavel Stehule (#16)
Re: badly calculated width of emoji in psql

I tried this patch on and MacOS11/iterm2 and RHEL 7 (ssh'd from the Mac, in
case that matters) and the example shown at the top of the thread shows no
difference:

john.naylor=# \pset border 2
Border style is 2.
john.naylor=# SELECT U&'\+01F603';
+----------+
| ?column? |
+----------+
| 😃 |
+----------+
(1 row)

(In case it doesn't render locally, the right bar in the result cell is
still shifted to the right.

What is the expected context to show a behavior change? Does one need some
specific terminal or setting?

--
John Naylor
EDB: http://www.enterprisedb.com

#18Pavel Stehule
pavel.stehule@gmail.com
In reply to: John Naylor (#17)
Re: badly calculated width of emoji in psql

čt 12. 8. 2021 v 18:36 odesílatel John Naylor <john.naylor@enterprisedb.com>
napsal:

I tried this patch on and MacOS11/iterm2 and RHEL 7 (ssh'd from the Mac,
in case that matters) and the example shown at the top of the thread shows
no difference:

john.naylor=# \pset border 2
Border style is 2.
john.naylor=# SELECT U&'\+01F603';
+----------+
| ?column? |
+----------+
| 😃 |
+----------+
(1 row)

did you run make clean?

when I executed just patch & make, it didn't work

(In case it doesn't render locally, the right bar in the result cell is
still shifted to the right.

What is the expected context to show a behavior change? Does one need some
specific terminal or setting?

I assigned screenshots

Show quoted text

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

broken.pngimage/png; name=broken.pngDownload
ok.pngimage/png; name=ok.pngDownload
#19Jacob Champion
jacob.champion@enterprisedb.com
In reply to: John Naylor (#17)
Re: badly calculated width of emoji in psql

On Thu, 2021-08-12 at 12:36 -0400, John Naylor wrote:

I tried this patch on and MacOS11/iterm2 and RHEL 7 (ssh'd from the Mac, in case that matters) and the example shown at the top of the thread shows no difference:

john.naylor=# \pset border 2
Border style is 2.
john.naylor=# SELECT U&'\+01F603';
+----------+
| ?column? |
+----------+
| 😃 |
+----------+
(1 row)

(In case it doesn't render locally, the right bar in the result cell is still shifted to the right.

What is the expected context to show a behavior change?

There shouldn't be anything special. (If your terminal was set up to
display emoji in single columns, that would cause alignment issues, but
in the opposite direction to the one you're seeing.)

Does one need some specific terminal or setting?

In your case, an incorrect number of spaces are being printed, so it
shouldn't have anything to do with your terminal settings.

Was this a clean build? Perhaps I've introduced (or exacerbated) a
dependency bug in the Makefile? The patch doing nothing is a surprising
result given the code change.

--Jacob

#20John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#19)
Re: badly calculated width of emoji in psql

On Thu, Aug 12, 2021 at 12:46 PM Pavel Stehule <pavel.stehule@gmail.com>
wrote:

did you run make clean?

when I executed just patch & make, it didn't work

I did not, but I always have --enable-depend on. I tried again with make
clean, and ccache -C just in case, and it works now.

On Thu, Aug 12, 2021 at 12:54 PM Jacob Champion <pchampion@vmware.com>
wrote:

Was this a clean build? Perhaps I've introduced (or exacerbated) a
dependency bug in the Makefile? The patch doing nothing is a surprising
result given the code change.

Yeah, given that Pavel had the same issue, that's a possibility. I don't
recall that happening with other unicode patches I've tested.

--
John Naylor
EDB: http://www.enterprisedb.com

#21John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#20)
#22Jacob Champion
jacob.champion@enterprisedb.com
In reply to: John Naylor (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: John Naylor (#21)
#24John Naylor
john.naylor@enterprisedb.com
In reply to: Michael Paquier (#23)
#25Jacob Champion
jacob.champion@enterprisedb.com
In reply to: John Naylor (#24)
#26John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#25)
#27John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#24)
#28Peter Eisentraut
peter_e@gmx.net
In reply to: John Naylor (#26)
#29Jacob Champion
jacob.champion@enterprisedb.com
In reply to: John Naylor (#27)
#30John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#29)
#31John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#30)
#32Jacob Champion
jacob.champion@enterprisedb.com
In reply to: John Naylor (#30)
#33John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#32)
#34John Naylor
john.naylor@enterprisedb.com
In reply to: Jacob Champion (#32)
#35John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#34)
#36Jacob Champion
jacob.champion@enterprisedb.com
In reply to: John Naylor (#34)
#37Pavel Stehule
pavel.stehule@gmail.com
In reply to: Jacob Champion (#36)