Unicode grapheme clusters

Started by Bruce Momjianalmost 3 years ago18 messages

bruce@momjian.us

almost 3 years ago

Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:

👩🏼‍⚕️🩺

libc
Unicode UTF8 len
U+1F469 f0 9f 91 a9 2 woman
U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin tone)
U+200D e2 80 8d 0 zero width joiner (ZWJ)
U+2695 e2 9a 95 1 staff with snake
U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous character as emoji)
U+1FA7A f0 9f a9 ba 2 stethoscope

Now, in Debian 11 character apps like vi, I see:

a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)

Display widths are in parentheses. I also see '<200d>' in blue.

In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope. Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.

For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:

https://tonsky.me/blog/emoji/

These comments explain the confusion of the term character:

https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

and I think this comment summarizes it well:

https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237

This is by design. wcwidth() is utterly broken. Any terminal or terminal
application that uses it is also utterly broken. Forget about emoji
wcwidth() doesn't even work with combining characters, zero width
joiners, flags, and a whole bunch of other things.

I decided to see how Postgres, without ICU, handles it:

show lc_ctype;
lc_ctype
-------------
en_US.UTF-8

select octet_length('👩🏼‍⚕️🩺');
octet_length
--------------
21

select character_length('👩🏼‍⚕️🩺');
character_length
------------------
6

The octet_length() is verified as correct by counting the UTF8 bytes
above. I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.

I then started looking at how Postgres computes and uses _display_
width. The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.) Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.

libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().

There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:

pg_wcswidth(const char *pwcs, size_t len, int encoding)
UTF8 encoding == 6

(gdb) print (int)pg_wcswidth("abcd", 4, 6)
$8 = 4
(gdb) print (int)pg_wcswidth("👩🏼‍⚕️🩺", 21, 6))
$9 = 7

Here is the psql output:

SELECT octet_length('👩🏼‍⚕️🩺'), '👩🏼‍⚕️🩺', character_length('👩🏼‍⚕️🩺');
octet_length | ?column? | character_length
--------------+----------+------------------
21 | 👩🏼‍⚕️🩺 | 6

More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().

I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.

tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen. p_isspecial() also has a small table of what it
calls "strange_letter",

Here is a report about Unicode variation selector and combining
characters from May, 2022:

/messages/by-id/013f01d873bb$ff5f64b0$fe1e2e10$@ndensan.co.jp

Is this something people want improved?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Pavel Stehule

pavel.stehule@gmail.com

almost 3 years ago

In reply to: Bruce Momjian (#1)

Re: Unicode grapheme clusters

čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian <bruce@momjian.us> napsal:

Just my luck, I had to dig into a two-"character" emoji that came to me
as part of a Google Calendar entry --- here it is:

👩🏼‍⚕️🩺

libc
Unicode UTF8 len
U+1F469 f0 9f 91 a9 2 woman
U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin
tone)
U+200D e2 80 8d 0 zero width joiner (ZWJ)
U+2695 e2 9a 95 1 staff with snake
U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous
character as emoji)
U+1FA7A f0 9f a9 ba 2 stethoscope

Now, in Debian 11 character apps like vi, I see:

a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)

Display widths are in parentheses. I also see '<200d>' in blue.

In current Firefox, I see a woman with a stethoscope around her neck,
and then a stethoscope. Copying the Unicode string above into a browser
URL bar should show you the same thing, thought it might be too small to
see.

For those looking for details on how these should be handled, see this
for an explanation of grapheme clusters that use things like skin tone
modifiers and zero-width joiners:

https://tonsky.me/blog/emoji/

These comments explain the confusion of the term character:

https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

and I think this comment summarizes it well:

https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237

This is by design. wcwidth() is utterly broken. Any terminal or
terminal
application that uses it is also utterly broken. Forget about emoji
wcwidth() doesn't even work with combining characters, zero width
joiners, flags, and a whole bunch of other things.

I decided to see how Postgres, without ICU, handles it:

show lc_ctype;
lc_ctype
-------------
en_US.UTF-8

select octet_length('👩🏼‍⚕️🩺');
octet_length
--------------
21

select character_length('👩🏼‍⚕️🩺');
character_length
------------------
6

The octet_length() is verified as correct by counting the UTF8 bytes
above. I think character_length() is correct if we consider the number
of Unicode characters, display and non-display.

I then started looking at how Postgres computes and uses _display_
width. The display width, when properly processed like by Firefox, is 4
(two double-wide displayed characters.) Based on the libc display
lengths above and incorrect displayed character lengths in Debian 11, it
would be 7.

libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
the per-encoding width function stored in pg_wchar_table.dsplen --- for
UTF8, the function is pg_utf_dsplen().

There is no SQL API for display length, but PQdsplen() that can be
called with a string by calling pg_wcswidth() the gdb debugger:

pg_wcswidth(const char *pwcs, size_t len, int encoding)
UTF8 encoding == 6

(gdb) print (int)pg_wcswidth("abcd", 4, 6)
$8 = 4
(gdb) print (int)pg_wcswidth("👩🏼‍⚕️🩺", 21, 6))
$9 = 7

Here is the psql output:

SELECT octet_length('👩🏼‍⚕️🩺'), '👩🏼‍⚕️🩺',
character_length('👩🏼‍⚕️🩺');
octet_length | ?column? | character_length
--------------+----------+------------------
21 | 👩🏼‍⚕️🩺 | 6

More often called from psql are pg_wcssize() and pg_wcsformat(), which
also calls PQdsplen().

I think the question is whether we want to report a string width that
assumes the display doesn't understand the more complex UTF8
controls/"characters" listed above.

tsearch has p_isspecial() calls pg_dsplen() which also uses
pg_wchar_table.dsplen. p_isspecial() also has a small table of what it
calls "strange_letter",

Here is a report about Unicode variation selector and combining
characters from May, 2022:

/messages/by-id/013f01d873bb$ff5f64b0$fe1e2e10$@ndensan.co.jp

Is this something people want improved?

Surely it should be fixed. Unfortunately - all the terminals that I can use
don't support it. So at this moment it may be premature to fix it, because
the visual form will still be broken.

Regards

Pavel

Show quoted text

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Pavel Stehule (#2)

Re: Unicode grapheme clusters

On Thu, Jan 19, 2023 at 02:44:57PM +0100, Pavel Stehule wrote:

Surely it should be fixed. Unfortunately - all the terminals that I can use
don't support it. So at this moment it may be premature to fix it, because the
visual form will still be broken.

Yes, none of my terminal emulators handle grapheme clusters either. In
fact, viewing this email messed up my screen and I had to use control-L
to fix it.

I think one big problem is that our Unicode library doesn't have any way
I know of to query the display device to determine how it
supports/renders Unicode characters, so any display width we report
could be wrong.

Oddly, it seems grapheme clusters were added in Unicode 3.2, which came
out in 2002:

https://www.unicode.org/reports/tr28/tr28-3.html
https://www.quora.com/What-is-graphemeCluster

but somehow I am only seeing studying them now.

Anyway, I added a psql item for this so we don't forget about it:

https://wiki.postgresql.org/wiki/Todo#psql

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Greg Stark

stark@mit.edu

almost 3 years ago

In reply to: Bruce Momjian (#1)

Re: Unicode grapheme clusters

This is how we've always documented it. Postgres treats code points as
"characters" not graphemes.

You don't need to go to anything as esoteric as emojis to see this either.
Accented characters like é have no canonical forms that are multiple code
points and in some character sets some accented characters can only be
represented that way.

But I don't think there's any reason to consider changing e existing
functions. They have to be consistent with substr and the other string
manipulation functions.

We could add new functions to work with graphemes but it might bring more
pain keeping it up to date....

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Greg Stark (#4)

Re: Unicode grapheme clusters

On Thu, Jan 19, 2023 at 07:37:48PM -0500, Greg Stark wrote:

This is how we've always documented it. Postgres treats code points as
"characters" not graphemes.

You don't need to go to anything as esoteric as emojis to see this either.
Accented characters like é have no canonical forms that are multiple code
points and in some character sets some accented characters can only be
represented that way.

But I don't think there's any reason to consider changing e existing functions.
They have to be consistent with substr and the other string manipulation
functions.

We could add new functions to work with graphemes but it might bring more pain
keeping it up to date....

I am not sure what you are referring to above? character_length? I was
talking about display length, and psql uses that --- at some point, our
lack of support for graphemes will cause psql to not align columns.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Bruce Momjian (#5)

Re: Unicode grapheme clusters

Bruce Momjian <bruce@momjian.us> writes:

I am not sure what you are referring to above? character_length? I was
talking about display length, and psql uses that --- at some point, our
lack of support for graphemes will cause psql to not align columns.

That's going to happen regardless, as long as we can't be sure
what the display will do with the characters --- and that's a
problem that will persist for a very long time.

Ideally, yeah, it'd be great if all this stuff rendered perfectly;
but IMO it's so far outside mainstream usage of psql that it's
not something that could possibly repay the investment of time
to get even a partial solution.

regards, tom lane

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Tom Lane (#6)

Re: Unicode grapheme clusters

On Thu, Jan 19, 2023 at 07:53:43PM -0500, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I am not sure what you are referring to above? character_length? I was
talking about display length, and psql uses that --- at some point, our
lack of support for graphemes will cause psql to not align columns.

That's going to happen regardless, as long as we can't be sure
what the display will do with the characters --- and that's a
problem that will persist for a very long time.

Ideally, yeah, it'd be great if all this stuff rendered perfectly;
but IMO it's so far outside mainstream usage of psql that it's
not something that could possibly repay the investment of time
to get even a partial solution.

We have a few options:

* TODO item
* document psql works that way
* do nothing

I think the big question is how common such cases will be in the future.
The report from 2022, and one from 2019 didn't seem to clearly outline
the issue so it would good to have something documented somewhere.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Pavel Stehule

pavel.stehule@gmail.com

almost 3 years ago

In reply to: Bruce Momjian (#7)

Re: Unicode grapheme clusters

pá 20. 1. 2023 v 2:55 odesílatel Bruce Momjian <bruce@momjian.us> napsal:

On Thu, Jan 19, 2023 at 07:53:43PM -0500, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I am not sure what you are referring to above? character_length? I

was

talking about display length, and psql uses that --- at some point, our
lack of support for graphemes will cause psql to not align columns.

That's going to happen regardless, as long as we can't be sure
what the display will do with the characters --- and that's a
problem that will persist for a very long time.

Ideally, yeah, it'd be great if all this stuff rendered perfectly;
but IMO it's so far outside mainstream usage of psql that it's
not something that could possibly repay the investment of time
to get even a partial solution.

We have a few options:

* TODO item
* document psql works that way
* do nothing

I think the big question is how common such cases will be in the future.
The report from 2022, and one from 2019 didn't seem to clearly outline
the issue so it would good to have something documented somewhere.

There can be a note in psql documentation like "Unicode grapheme clusters
are not supported yet. It is not well supported by other necessary software
like terminal emulators and curses libraries".

I partially watch an progres in VTE - one of the widely used terminal libs,
and I am very sceptical so there will be support in the next two years.

Maybe the new microsoft terminal will give this area a new dynamic, but
currently only few people on the planet are working on fixing or enhancing
terminal's technologies. Unfortunately there is too much historical balast.

Regards

Pavel

Show quoted text

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Greg Stark

stark@mit.edu

almost 3 years ago

In reply to: Pavel Stehule (#8)

Re: Unicode grapheme clusters

On Fri, 20 Jan 2023 at 00:07, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I partially watch an progres in VTE - one of the widely used terminal libs, and I am very sceptical so there will be support in the next two years.

Maybe the new microsoft terminal will give this area a new dynamic, but currently only few people on the planet are working on fixing or enhancing terminal's technologies. Unfortunately there is too much historical balast.

Fwiw this isn't really about terminal emulators. psql is also used to
generate text files for reports or for display in various ways.

I think it's worth using whatever APIs we have available to implement
better alignment for grapheme clusters and just assume whatever will
eventually be used to display the output will display it "properly".

I do not think it's worth trying to implement this ourselves if the
libraries aren't there yet. And I don't think it's worth trying to
adapt to the current state of the current terminal. We don't know that
that's the only place the output will be viewed and it'll all be
wasted effort when the terminals eventually implement full support.

(If we were really crazy about this we could use terminal escape codes
to query the current cursor position after emitting multicharacter
graphemes. But as I said, I don't even think that would be useful,
even if there weren't other reasons it would be a bad idea)

--
greg

#10

Pavel Stehule

pavel.stehule@gmail.com

almost 3 years ago

In reply to: Greg Stark (#9)

Re: Unicode grapheme clusters

so 21. 1. 2023 v 17:21 odesílatel Greg Stark <stark@mit.edu> napsal:

On Fri, 20 Jan 2023 at 00:07, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

I partially watch an progres in VTE - one of the widely used terminal

libs, and I am very sceptical so there will be support in the next two
years.

Maybe the new microsoft terminal will give this area a new dynamic, but

currently only few people on the planet are working on fixing or enhancing
terminal's technologies. Unfortunately there is too much historical balast.

Fwiw this isn't really about terminal emulators. psql is also used to
generate text files for reports or for display in various ways.

I think it's worth using whatever APIs we have available to implement
better alignment for grapheme clusters and just assume whatever will
eventually be used to display the output will display it "properly".

I do not think it's worth trying to implement this ourselves if the
libraries aren't there yet. And I don't think it's worth trying to
adapt to the current state of the current terminal. We don't know that
that's the only place the output will be viewed and it'll all be
wasted effort when the terminals eventually implement full support.

(If we were really crazy about this we could use terminal escape codes
to query the current cursor position after emitting multicharacter
graphemes. But as I said, I don't even think that would be useful,
even if there weren't other reasons it would be a bad idea)

Pavel

Show quoted text

--
greg

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Greg Stark (#9)

Re: Unicode grapheme clusters

Greg Stark <stark@mit.edu> writes:

(If we were really crazy about this we could use terminal escape codes
to query the current cursor position after emitting multicharacter
graphemes. But as I said, I don't even think that would be useful,
even if there weren't other reasons it would be a bad idea)

Yeah, use of a pager would be enough to break that.

regards, tom lane

#12

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Greg Stark (#9)

Re: Unicode grapheme clusters

On Sat, Jan 21, 2023 at 11:20:39AM -0500, Greg Stark wrote:

On Fri, 20 Jan 2023 at 00:07, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I partially watch an progres in VTE - one of the widely used terminal libs, and I am very sceptical so there will be support in the next two years.

Maybe the new microsoft terminal will give this area a new dynamic, but currently only few people on the planet are working on fixing or enhancing terminal's technologies. Unfortunately there is too much historical balast.

Fwiw this isn't really about terminal emulators. psql is also used to
generate text files for reports or for display in various ways.

I think it's worth using whatever APIs we have available to implement
better alignment for grapheme clusters and just assume whatever will
eventually be used to display the output will display it "properly".

I do not think it's worth trying to implement this ourselves if the
libraries aren't there yet. And I don't think it's worth trying to
adapt to the current state of the current terminal. We don't know that
that's the only place the output will be viewed and it'll all be
wasted effort when the terminals eventually implement full support.

Well, as one of the URLs I quoted said:

This is by design. wcwidth() is utterly broken. Any terminal or
terminal application that uses it is also utterly broken. Forget
about emoji wcwidth() doesn't even work with combining characters,
zero width joiners, flags, and a whole bunch of other things.

So, either we have to find a function in the library that will do the
looping over the string for us, or we need to identify the special
Unicode characters that create grapheme clusters and handle them in our
code.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

#13

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Bruce Momjian (#12)

1 attachment(s)

Re: Unicode grapheme clusters

On Sat, Jan 21, 2023 at 12:37:30PM -0500, Bruce Momjian wrote:

Well, as one of the URLs I quoted said:

This is by design. wcwidth() is utterly broken. Any terminal or
terminal application that uses it is also utterly broken. Forget
about emoji wcwidth() doesn't even work with combining characters,
zero width joiners, flags, and a whole bunch of other things.

So, either we have to find a function in the library that will do the
looping over the string for us, or we need to identify the special
Unicode characters that create grapheme clusters and handle them in our
code.

I just checked if wcswidth() would honor graphene clusters, though
wcwidth() does not, but it seems wcswidth() treats characters just like
wcwidth():

$ LANG=en_US.UTF-8 grapheme_test
wcswidth len=7

bytes_consumed=4, wcwidth len=2
bytes_consumed=4, wcwidth len=2
bytes_consumed=3, wcwidth len=0
bytes_consumed=3, wcwidth len=1
bytes_consumed=3, wcwidth len=0
bytes_consumed=4, wcwidth len=2

C test program attached. This is on Debian 11.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Bruce Momjian (#13)

Re: Unicode grapheme clusters

Bruce Momjian <bruce@momjian.us> writes:

I just checked if wcswidth() would honor graphene clusters, though
wcwidth() does not, but it seems wcswidth() treats characters just like
wcwidth():

Well, that's at least potentially fixable within libc, while wcwidth
clearly can never do this right.

Probably our long-term answer is to avoid depending on wcwidth
and use wcswidth instead. But it's hard to get excited about
doing the legwork for that until popular libc implementations
get it right.

regards, tom lane

#15

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Tom Lane (#14)

Re: Unicode grapheme clusters

On Sat, Jan 21, 2023 at 01:17:27PM -0500, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I just checked if wcswidth() would honor graphene clusters, though
wcwidth() does not, but it seems wcswidth() treats characters just like
wcwidth():

Well, that's at least potentially fixable within libc, while wcwidth
clearly can never do this right.

Probably our long-term answer is to avoid depending on wcwidth
and use wcswidth instead. But it's hard to get excited about
doing the legwork for that until popular libc implementations
get it right.

Agreed.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

#16

Greg Stark

stark@mit.edu

almost 3 years ago

In reply to: Tom Lane (#14)

Re: Unicode grapheme clusters

On Sat, 21 Jan 2023 at 13:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Probably our long-term answer is to avoid depending on wcwidth
and use wcswidth instead. But it's hard to get excited about
doing the legwork for that until popular libc implementations
get it right.

Here's an interesting blog post about trying to do this in Rust:

https://tomdebruijn.com/posts/rust-string-length-width-calculations/

TL;DR... Even counting the number of graphemes isn't enough because
terminals typically (but not always) display emoji graphemes using two
columns.

At the end of the day Unicode kind of assumes a variable-width display
where the rendering is handled by something that has access to the
actual font metrics. So anything trying to line things up in columns
in a way that works with any rendering system down the line using any
font is going to be making a best guess.

--
greg

#17

Isaac Morland

isaac.morland@gmail.com

almost 3 years ago

In reply to: Greg Stark (#16)

Re: Unicode grapheme clusters

On Tue, 24 Jan 2023 at 11:40, Greg Stark <stark@mit.edu> wrote:

At the end of the day Unicode kind of assumes a variable-width display
where the rendering is handled by something that has access to the
actual font metrics. So anything trying to line things up in columns
in a way that works with any rendering system down the line using any
font is going to be making a best guess.

Really what is needed is another Unicode attribute: how many columns of a
monospaced display each character (or grapheme cluster) should take up. The
standard should include a precisely defined function that can take any
sequence of characters and give back its width in monospaced display
character spaces. Typefaces should only qualify as monospaced if they
respect this standard-defined computation.

Note that this is not actually a new thing: this was included in ASCII
implicitly, with a value of 1 for every character, and a value of n for
every n-character string. It has always been possible to line up values
displayed on monospaced displays by adding spaces, and it is only the
omission of this feature from Unicode which currently makes it impossible.

#18

Bruce Momjian

bruce@momjian.us

almost 3 years ago

In reply to: Greg Stark (#16)

Re: Unicode grapheme clusters

On Tue, Jan 24, 2023 at 11:40:01AM -0500, Greg Stark wrote:

On Sat, 21 Jan 2023 at 13:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Probably our long-term answer is to avoid depending on wcwidth
and use wcswidth instead. But it's hard to get excited about
doing the legwork for that until popular libc implementations
get it right.

Here's an interesting blog post about trying to do this in Rust:

https://tomdebruijn.com/posts/rust-string-length-width-calculations/

TL;DR... Even counting the number of graphemes isn't enough because
terminals typically (but not always) display emoji graphemes using two
columns.

At the end of the day Unicode kind of assumes a variable-width display
where the rendering is handled by something that has access to the
actual font metrics. So anything trying to line things up in columns
in a way that works with any rendering system down the line using any
font is going to be making a best guess.

Yes, good article, though I am still surprised this is not discussed
more often. Anyway, for psql, we assume a fixed width output device, so
we can just assume that for computation. You are right that Unicode
just doesn't seem to consider fixed width output cases and doesn't
provide much guidance.

Beyond psql, should we update our docs to say that character_length()
for Unicode returns the number of Unicode code points, and not
necessarily the number of displayed characters if grapheme clusters are
present?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Embrace your flaws. They make you human, rather than perfect,
which you will never be.

Unicode grapheme clusters

Attachments: