psql display of Unicode combining characters in 8.2

Started by Michael Fuhrabout 19 years ago9 messages
#1Michael Fuhr
mike@fuhr.org

psql's display of Unicode combining characters appears to have
changed in 8.2. For example, I'd expect <U+006E LATIN SMALL LETTER N,
U+0303 COMBINING TILDE> to display the same as the precomposed
<U+00F1 LATIN SMALL LETTER N WITH TILDE>. With 8.1's psql they do,
but with 8.2's psql this sequence displays as:

SELECT E'n\314\203'; -- \314\203 = UTF-8 encoding of U+0303
?column?
----------
n\u0303
(1 row)

(I'm testing with both server and client using UTF-8.)

This excerpt from pg_wcsformat() in mbprint.c looks responsible:

else if (w <= 0) /* Non-ascii control char */
{
if (encoding == PG_UTF8)
sprintf((char *) ptr, "\\u%04X", utf2ucs(pwcs));

This might be the relevant commit:

http://archives.postgresql.org/pgsql-committers/2006-02/msg00089.php

Should the code distinguish between combining characters and
zero-width control characters so the former display correctly?

--
Michael Fuhr

#2Michael Fuhr
mike@fuhr.org
In reply to: Michael Fuhr (#1)
Re: psql display of Unicode combining characters in 8.2

On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:

psql's display of Unicode combining characters appears to have
changed in 8.2.

I forgot to mention that this change is in aligned output; unaligned
output prints sequences with combining characters as I'd expect:

test=> SELECT E'n\314\203';
?column?
----------
n\u0303
(1 row)

test=> \a
Output format is unaligned.
test=> SELECT E'n\314\203';
?column?
�
(1 row)

--
Michael Fuhr

#3Martijn van Oosterhout
kleptog@svana.org
In reply to: Michael Fuhr (#1)
Re: psql display of Unicode combining characters in 8.2

On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:

Should the code distinguish between combining characters and
zero-width control characters so the former display correctly?

Probably, any idea how to tell the difference?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#3)
Re: psql display of Unicode combining characters in 8.2

Martijn van Oosterhout <kleptog@svana.org> writes:

On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:

Should the code distinguish between combining characters and
zero-width control characters so the former display correctly?

Probably, any idea how to tell the difference?

I'm no expert, but isn't there a specific range of Unicode code points
defined for combining characters?

regards, tom lane

#5Michael Fuhr
mike@fuhr.org
In reply to: Tom Lane (#4)
Re: psql display of Unicode combining characters in 8.2

On Sun, Dec 10, 2006 at 12:30:12PM -0500, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:

Should the code distinguish between combining characters and
zero-width control characters so the former display correctly?

Probably, any idea how to tell the difference?

I'm no expert, but isn't there a specific range of Unicode code points
defined for combining characters?

Yes, several, with others scattered about. Could we use the general
category (Mn = Mark, nonspacing; Me = Mark, enclosing)? ucs_wcwidth()
in src/backend/utils/mb/wchar.c already contains some of that
knowledge, doesn't it? The combining[] list looks incomplete but
otherwise close to what we'd need.

--
Michael Fuhr

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Fuhr (#5)
Re: psql display of Unicode combining characters in 8.2

Michael Fuhr <mike@fuhr.org> writes:

On Sun, Dec 10, 2006 at 12:30:12PM -0500, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

On Sat, Dec 09, 2006 at 10:50:05PM -0700, Michael Fuhr wrote:

Should the code distinguish between combining characters and
zero-width control characters so the former display correctly?

Probably, any idea how to tell the difference?

I'm no expert, but isn't there a specific range of Unicode code points
defined for combining characters?

Yes, several, with others scattered about.

What about the other way around: use the \u output convention only for
things we can specifically identify as control chars, and assume that
anything else with zero width is a combining char? Is there anything
other than 0-31 and 128-159 that should really get the \u treatment?

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: psql display of Unicode combining characters in 8.2

I wrote:

What about the other way around: use the \u output convention only for
things we can specifically identify as control chars, and assume that
anything else with zero width is a combining char? Is there anything
other than 0-31 and 128-159 that should really get the \u treatment?

Actually, looking at the comments for ucs_wcwidth() in wchar.c, it seems
that this is already accounted for in the "dsplen" output: characters
for which -1 is returned are control characters, characters for which
0 is returned should be printed as-is and counted as zero width. So the
bug is just that pg_wcsformat conflates the two cases.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: psql display of Unicode combining characters in 8.2

I wrote:

Actually, looking at the comments for ucs_wcwidth() in wchar.c, it seems
that this is already accounted for in the "dsplen" output: characters
for which -1 is returned are control characters, characters for which
0 is returned should be printed as-is and counted as zero width. So the
bug is just that pg_wcsformat conflates the two cases.

I've applied the attached patch to fix this, but not being much of a
user of languages that have combining characters, I can't test it very
well. Please check out the behavior and see if you like it.

regards, tom lane

#9Michael Fuhr
mike@fuhr.org
In reply to: Tom Lane (#8)
Re: psql display of Unicode combining characters in 8.2

On Wed, Dec 27, 2006 at 02:49:41PM -0500, Tom Lane wrote:

I've applied the attached patch to fix this, but not being much of a
user of languages that have combining characters, I can't test it very
well. Please check out the behavior and see if you like it.

Looks good so far. I've tested languages like Vietnamese (Latin
script with lots of diacritics), polytonic Greek, and pointed Hebrew,
with text normalized to both NFC and NFD. Before the patch the NFD
text had lots of \u escapes; after the patch it looks identical to
the NFC text aside from a few minor differences in the rendered
glyphs, which tells me that I am indeed receiving the decomposed
sequences.

Thanks!

--
Michael Fuhr