broken table formatting in psql
Hi
I found some data that are badly formatted in psql
create table foo(a varchar);
insert into foo values('Dětská šperkovnice Goki 15545');
insert into foo values('Tlakoměr Omron Evolv s Bluetooth připojením');
insert into foo values('Řetěz KMC BE08SEP22 stříbrný');
psql older than 12 shows this table correctly
(2022-09-01 08:42:44) postgres=# select * from foo;
┌─────────────────────────────────────────────┐
│ a │
╞═════════════════════════════════════════════╡
│ Dětská šperkovnice Goki 15545 │
│ Tlakoměr Omron Evolv s Bluetooth připojením │
│ Řetěz KMC BE08SEP22 stříbrný │
└─────────────────────────────────────────────┘
(3 rows)
psql 12 and later breaks border little bit
(2022-09-01 08:42:49) postgres=# select * from foo;
┌─────────────────────────────────────────────┐
│ a │
╞═════════════════════════════════════════════╡
│ Dětská šperkovnice Goki 15545 │
│ Tlakoměr Omron Evolv s Bluetooth připojením │
│ Řetěz KMC BE08SEP22 stříbrný │
└─────────────────────────────────────────────┘
(3 rows)
problem is in bad width of invisible char 200E
https://unicodeplus.com/U+200E
(2022-09-01 09:10:05) postgres=# select e'Ahoj\u200eNazdar';
┌─────────────┐
│ ?column? │
╞═════════════╡
│ AhojNazdar │
└─────────────┘
(1 row)
Regards
Pavel
On Thu, Sep 1, 2022 at 2:13 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
problem is in bad width of invisible char 200E
I removed this comment in bab982161e since it didn't match the code.
I'd be interested to see what happened after v12.
- * - Other format characters (general category code Cf in the Unicode
- * database) and ZERO WIDTH SPACE (U+200B) have a column
width of 0.
UnicodeData.txt has this:
200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;
So maybe we need to take Cf characters in this file into account, in
addition to Me and Mn (combining characters).
--
John Naylor
EDB: http://www.enterprisedb.com
At Thu, 1 Sep 2022 15:00:38 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
On Thu, Sep 1, 2022 at 2:13 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
problem is in bad width of invisible char 200E
I removed this comment in bab982161e since it didn't match the code.
I'd be interested to see what happened after v12.- * - Other format characters (general category code Cf in the Unicode
- * database) and ZERO WIDTH SPACE (U+200B) have a column
width of 0.UnicodeData.txt has this:
200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;So maybe we need to take Cf characters in this file into account, in
addition to Me and Mn (combining characters).
Including them into unicode_combining_table.h actually worked, but I'm
not sure it is valid to include Cf's among Mn/Me's..
diff --git a/src/common/unicode/generate-unicode_combining_table.pl b/src/common/unicode/generate-unicode_combining_table.pl index 8177c20260..7030bc637b 100644 --- a/src/common/unicode/generate-unicode_combining_table.pl +++ b/src/common/unicode/generate-unicode_combining_table.pl @@ -25,7 +25,7 @@ foreach my $line (<ARGV>) my @fields = split ';', $line; $codepoint = hex $fields[0];- if ($fields[2] eq 'Me' || $fields[2] eq 'Mn') + if ($fields[2] eq 'Me' || $fields[2] eq 'Mn' || $fields[2] eq 'Cf') { # combining character, save for start of range if (!defined($range_start))
By the way I was super annoyed that it was super-hard to reflect the
changes under src/common to the final binary. There are two hops of
missing dependencies and finally ccache stood in my way.. I find that
Andres once meant to try that using --dependency-files but I hope we
make that reflection automated even if we do define the dependencies
manually..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 01 Sep 2022 18:22:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Thu, 1 Sep 2022 15:00:38 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
UnicodeData.txt has this:
200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;So maybe we need to take Cf characters in this file into account, in
addition to Me and Mn (combining characters).Including them into unicode_combining_table.h actually worked, but I'm
not sure it is valid to include Cf's among Mn/Me's..
UnicodeData.txt
174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;
Soft-hyphen seems like not zero-width.. usually...
0600;ARABIC NUMBER SIGN;Cf;0;AN;;;;;N;;;;;
110BD;KAITHI NUMBER SIGN;Cf;0;L;;;;;N;;;;;
Mmm. These looks like not zero-width?
However, it seems like basically a win if we include "Cf"s to the
"combining" table?
====
174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;
1499:0600;ARABIC NUMBER SIGN;Cf;0;AN;;;;;N;;;;;
1500:0601;ARABIC SIGN SANAH;Cf;0;AN;;;;;N;;;;;
1501:0602;ARABIC FOOTNOTE MARKER;Cf;0;AN;;;;;N;;;;;
1502:0603;ARABIC SIGN SAFHA;Cf;0;AN;;;;;N;;;;;
1503:0604;ARABIC SIGN SAMVAT;Cf;0;AN;;;;;N;;;;;
1504:0605;ARABIC NUMBER MARK ABOVE;Cf;0;AN;;;;;N;;;;;
1527:061C;ARABIC LETTER MARK;Cf;0;AL;;;;;N;;;;;
1720:06DD;ARABIC END OF AYAH;Cf;0;AN;;;;;N;;;;;
1769:070F;SYRIAC ABBREVIATION MARK;Cf;0;AL;;;;;N;;;;;
2124:0890;ARABIC POUND MARK ABOVE;Cf;0;AN;;;;;N;;;;;
2125:0891;ARABIC PIASTRE MARK ABOVE;Cf;0;AN;;;;;N;;;;;
2200:08E2;ARABIC DISPUTED END OF AYAH;Cf;0;AN;;;;;N;;;;;
5517:180E;MONGOLIAN VOWEL SEPARATOR;Cf;0;BN;;;;;N;;;;;
7365:200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
7366:200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
7367:200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
7368:200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
7369:200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;
7396:202A;LEFT-TO-RIGHT EMBEDDING;Cf;0;LRE;;;;;N;;;;;
7397:202B;RIGHT-TO-LEFT EMBEDDING;Cf;0;RLE;;;;;N;;;;;
7398:202C;POP DIRECTIONAL FORMATTING;Cf;0;PDF;;;;;N;;;;;
7399:202D;LEFT-TO-RIGHT OVERRIDE;Cf;0;LRO;;;;;N;;;;;
7400:202E;RIGHT-TO-LEFT OVERRIDE;Cf;0;RLO;;;;;N;;;;;
7450:2060;WORD JOINER;Cf;0;BN;;;;;N;;;;;
7451:2061;FUNCTION APPLICATION;Cf;0;BN;;;;;N;;;;;
7452:2062;INVISIBLE TIMES;Cf;0;BN;;;;;N;;;;;
7453:2063;INVISIBLE SEPARATOR;Cf;0;BN;;;;;N;;;;;
7454:2064;INVISIBLE PLUS;Cf;0;BN;;;;;N;;;;;
7455:2066;LEFT-TO-RIGHT ISOLATE;Cf;0;LRI;;;;;N;;;;;
7456:2067;RIGHT-TO-LEFT ISOLATE;Cf;0;RLI;;;;;N;;;;;
7457:2068;FIRST STRONG ISOLATE;Cf;0;FSI;;;;;N;;;;;
7458:2069;POP DIRECTIONAL ISOLATE;Cf;0;PDI;;;;;N;;;;;
7459:206A;INHIBIT SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;;
7460:206B;ACTIVATE SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;;
7461:206C;INHIBIT ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;;
7462:206D;ACTIVATE ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;;
7463:206E;NATIONAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;;
7464:206F;NOMINAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;;
16660:FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;
16886:FFF9;INTERLINEAR ANNOTATION ANCHOR;Cf;0;ON;;;;;N;;;;;
16887:FFFA;INTERLINEAR ANNOTATION SEPARATOR;Cf;0;ON;;;;;N;;;;;
16888:FFFB;INTERLINEAR ANNOTATION TERMINATOR;Cf;0;ON;;;;;N;;;;;
19731:110BD;KAITHI NUMBER SIGN;Cf;0;L;;;;;N;;;;;
19737:110CD;KAITHI NUMBER SIGN ABOVE;Cf;0;L;;;;;N;;;;;
24043:13430;EGYPTIAN HIEROGLYPH VERTICAL JOINER;Cf;0;L;;;;;N;;;;;
24044:13431;EGYPTIAN HIEROGLYPH HORIZONTAL JOINER;Cf;0;L;;;;;N;;;;;
24045:13432;EGYPTIAN HIEROGLYPH INSERT AT TOP START;Cf;0;L;;;;;N;;;;;
24046:13433;EGYPTIAN HIEROGLYPH INSERT AT BOTTOM START;Cf;0;L;;;;;N;;;;;
24047:13434;EGYPTIAN HIEROGLYPH INSERT AT TOP END;Cf;0;L;;;;;N;;;;;
24048:13435;EGYPTIAN HIEROGLYPH INSERT AT BOTTOM END;Cf;0;L;;;;;N;;;;;
24049:13436;EGYPTIAN HIEROGLYPH OVERLAY MIDDLE;Cf;0;L;;;;;N;;;;;
24050:13437;EGYPTIAN HIEROGLYPH BEGIN SEGMENT;Cf;0;L;;;;;N;;;;;
24051:13438;EGYPTIAN HIEROGLYPH END SEGMENT;Cf;0;L;;;;;N;;;;;
27838:1BCA0;SHORTHAND FORMAT LETTER OVERLAP;Cf;0;BN;;;;;N;;;;;
27839:1BCA1;SHORTHAND FORMAT CONTINUING OVERLAP;Cf;0;BN;;;;;N;;;;;
27840:1BCA2;SHORTHAND FORMAT DOWN STEP;Cf;0;BN;;;;;N;;;;;
27841:1BCA3;SHORTHAND FORMAT UP STEP;Cf;0;BN;;;;;N;;;;;
28386:1D173;MUSICAL SYMBOL BEGIN BEAM;Cf;0;BN;;;;;N;;;;;
28387:1D174;MUSICAL SYMBOL END BEAM;Cf;0;BN;;;;;N;;;;;
28388:1D175;MUSICAL SYMBOL BEGIN TIE;Cf;0;BN;;;;;N;;;;;
28389:1D176;MUSICAL SYMBOL END TIE;Cf;0;BN;;;;;N;;;;;
28390:1D177;MUSICAL SYMBOL BEGIN SLUR;Cf;0;BN;;;;;N;;;;;
28391:1D178;MUSICAL SYMBOL END SLUR;Cf;0;BN;;;;;N;;;;;
28392:1D179;MUSICAL SYMBOL BEGIN PHRASE;Cf;0;BN;;;;;N;;;;;
28393:1D17A;MUSICAL SYMBOL END PHRASE;Cf;0;BN;;;;;N;;;;;
34286:E0001;LANGUAGE TAG;Cf;0;BN;;;;;N;;;;;
34287:E0020;TAG SPACE;Cf;0;BN;;;;;N;;;;;
34288:E0021;TAG EXCLAMATION MARK;Cf;0;BN;;;;;N;;;;;
34289:E0022;TAG QUOTATION MARK;Cf;0;BN;;;;;N;;;;;
34290:E0023;TAG NUMBER SIGN;Cf;0;BN;;;;;N;;;;;
34291:E0024;TAG DOLLAR SIGN;Cf;0;BN;;;;;N;;;;;
34292:E0025;TAG PERCENT SIGN;Cf;0;BN;;;;;N;;;;;
34293:E0026;TAG AMPERSAND;Cf;0;BN;;;;;N;;;;;
34294:E0027;TAG APOSTROPHE;Cf;0;BN;;;;;N;;;;;
34295:E0028;TAG LEFT PARENTHESIS;Cf;0;BN;;;;;N;;;;;
34296:E0029;TAG RIGHT PARENTHESIS;Cf;0;BN;;;;;N;;;;;
34297:E002A;TAG ASTERISK;Cf;0;BN;;;;;N;;;;;
34298:E002B;TAG PLUS SIGN;Cf;0;BN;;;;;N;;;;;
34299:E002C;TAG COMMA;Cf;0;BN;;;;;N;;;;;
34300:E002D;TAG HYPHEN-MINUS;Cf;0;BN;;;;;N;;;;;
34301:E002E;TAG FULL STOP;Cf;0;BN;;;;;N;;;;;
34302:E002F;TAG SOLIDUS;Cf;0;BN;;;;;N;;;;;
34303:E0030;TAG DIGIT ZERO;Cf;0;BN;;;;;N;;;;;
34304:E0031;TAG DIGIT ONE;Cf;0;BN;;;;;N;;;;;
34305:E0032;TAG DIGIT TWO;Cf;0;BN;;;;;N;;;;;
34306:E0033;TAG DIGIT THREE;Cf;0;BN;;;;;N;;;;;
34307:E0034;TAG DIGIT FOUR;Cf;0;BN;;;;;N;;;;;
34308:E0035;TAG DIGIT FIVE;Cf;0;BN;;;;;N;;;;;
34309:E0036;TAG DIGIT SIX;Cf;0;BN;;;;;N;;;;;
34310:E0037;TAG DIGIT SEVEN;Cf;0;BN;;;;;N;;;;;
34311:E0038;TAG DIGIT EIGHT;Cf;0;BN;;;;;N;;;;;
34312:E0039;TAG DIGIT NINE;Cf;0;BN;;;;;N;;;;;
34313:E003A;TAG COLON;Cf;0;BN;;;;;N;;;;;
34314:E003B;TAG SEMICOLON;Cf;0;BN;;;;;N;;;;;
34315:E003C;TAG LESS-THAN SIGN;Cf;0;BN;;;;;N;;;;;
34316:E003D;TAG EQUALS SIGN;Cf;0;BN;;;;;N;;;;;
34317:E003E;TAG GREATER-THAN SIGN;Cf;0;BN;;;;;N;;;;;
34318:E003F;TAG QUESTION MARK;Cf;0;BN;;;;;N;;;;;
34319:E0040;TAG COMMERCIAL AT;Cf;0;BN;;;;;N;;;;;
34320:E0041;TAG LATIN CAPITAL LETTER A;Cf;0;BN;;;;;N;;;;;
34321:E0042;TAG LATIN CAPITAL LETTER B;Cf;0;BN;;;;;N;;;;;
34322:E0043;TAG LATIN CAPITAL LETTER C;Cf;0;BN;;;;;N;;;;;
34323:E0044;TAG LATIN CAPITAL LETTER D;Cf;0;BN;;;;;N;;;;;
34324:E0045;TAG LATIN CAPITAL LETTER E;Cf;0;BN;;;;;N;;;;;
34325:E0046;TAG LATIN CAPITAL LETTER F;Cf;0;BN;;;;;N;;;;;
34326:E0047;TAG LATIN CAPITAL LETTER G;Cf;0;BN;;;;;N;;;;;
34327:E0048;TAG LATIN CAPITAL LETTER H;Cf;0;BN;;;;;N;;;;;
34328:E0049;TAG LATIN CAPITAL LETTER I;Cf;0;BN;;;;;N;;;;;
34329:E004A;TAG LATIN CAPITAL LETTER J;Cf;0;BN;;;;;N;;;;;
34330:E004B;TAG LATIN CAPITAL LETTER K;Cf;0;BN;;;;;N;;;;;
34331:E004C;TAG LATIN CAPITAL LETTER L;Cf;0;BN;;;;;N;;;;;
34332:E004D;TAG LATIN CAPITAL LETTER M;Cf;0;BN;;;;;N;;;;;
34333:E004E;TAG LATIN CAPITAL LETTER N;Cf;0;BN;;;;;N;;;;;
34334:E004F;TAG LATIN CAPITAL LETTER O;Cf;0;BN;;;;;N;;;;;
34335:E0050;TAG LATIN CAPITAL LETTER P;Cf;0;BN;;;;;N;;;;;
34336:E0051;TAG LATIN CAPITAL LETTER Q;Cf;0;BN;;;;;N;;;;;
34337:E0052;TAG LATIN CAPITAL LETTER R;Cf;0;BN;;;;;N;;;;;
34338:E0053;TAG LATIN CAPITAL LETTER S;Cf;0;BN;;;;;N;;;;;
34339:E0054;TAG LATIN CAPITAL LETTER T;Cf;0;BN;;;;;N;;;;;
34340:E0055;TAG LATIN CAPITAL LETTER U;Cf;0;BN;;;;;N;;;;;
34341:E0056;TAG LATIN CAPITAL LETTER V;Cf;0;BN;;;;;N;;;;;
34342:E0057;TAG LATIN CAPITAL LETTER W;Cf;0;BN;;;;;N;;;;;
34343:E0058;TAG LATIN CAPITAL LETTER X;Cf;0;BN;;;;;N;;;;;
34344:E0059;TAG LATIN CAPITAL LETTER Y;Cf;0;BN;;;;;N;;;;;
34345:E005A;TAG LATIN CAPITAL LETTER Z;Cf;0;BN;;;;;N;;;;;
34346:E005B;TAG LEFT SQUARE BRACKET;Cf;0;BN;;;;;N;;;;;
34347:E005C;TAG REVERSE SOLIDUS;Cf;0;BN;;;;;N;;;;;
34348:E005D;TAG RIGHT SQUARE BRACKET;Cf;0;BN;;;;;N;;;;;
34349:E005E;TAG CIRCUMFLEX ACCENT;Cf;0;BN;;;;;N;;;;;
34350:E005F;TAG LOW LINE;Cf;0;BN;;;;;N;;;;;
34351:E0060;TAG GRAVE ACCENT;Cf;0;BN;;;;;N;;;;;
34352:E0061;TAG LATIN SMALL LETTER A;Cf;0;BN;;;;;N;;;;;
34353:E0062;TAG LATIN SMALL LETTER B;Cf;0;BN;;;;;N;;;;;
34354:E0063;TAG LATIN SMALL LETTER C;Cf;0;BN;;;;;N;;;;;
34355:E0064;TAG LATIN SMALL LETTER D;Cf;0;BN;;;;;N;;;;;
34356:E0065;TAG LATIN SMALL LETTER E;Cf;0;BN;;;;;N;;;;;
34357:E0066;TAG LATIN SMALL LETTER F;Cf;0;BN;;;;;N;;;;;
34358:E0067;TAG LATIN SMALL LETTER G;Cf;0;BN;;;;;N;;;;;
34359:E0068;TAG LATIN SMALL LETTER H;Cf;0;BN;;;;;N;;;;;
34360:E0069;TAG LATIN SMALL LETTER I;Cf;0;BN;;;;;N;;;;;
34361:E006A;TAG LATIN SMALL LETTER J;Cf;0;BN;;;;;N;;;;;
34362:E006B;TAG LATIN SMALL LETTER K;Cf;0;BN;;;;;N;;;;;
34363:E006C;TAG LATIN SMALL LETTER L;Cf;0;BN;;;;;N;;;;;
34364:E006D;TAG LATIN SMALL LETTER M;Cf;0;BN;;;;;N;;;;;
34365:E006E;TAG LATIN SMALL LETTER N;Cf;0;BN;;;;;N;;;;;
34366:E006F;TAG LATIN SMALL LETTER O;Cf;0;BN;;;;;N;;;;;
34367:E0070;TAG LATIN SMALL LETTER P;Cf;0;BN;;;;;N;;;;;
34368:E0071;TAG LATIN SMALL LETTER Q;Cf;0;BN;;;;;N;;;;;
34369:E0072;TAG LATIN SMALL LETTER R;Cf;0;BN;;;;;N;;;;;
34370:E0073;TAG LATIN SMALL LETTER S;Cf;0;BN;;;;;N;;;;;
34371:E0074;TAG LATIN SMALL LETTER T;Cf;0;BN;;;;;N;;;;;
34372:E0075;TAG LATIN SMALL LETTER U;Cf;0;BN;;;;;N;;;;;
34373:E0076;TAG LATIN SMALL LETTER V;Cf;0;BN;;;;;N;;;;;
34374:E0077;TAG LATIN SMALL LETTER W;Cf;0;BN;;;;;N;;;;;
34375:E0078;TAG LATIN SMALL LETTER X;Cf;0;BN;;;;;N;;;;;
34376:E0079;TAG LATIN SMALL LETTER Y;Cf;0;BN;;;;;N;;;;;
34377:E007A;TAG LATIN SMALL LETTER Z;Cf;0;BN;;;;;N;;;;;
34378:E007B;TAG LEFT CURLY BRACKET;Cf;0;BN;;;;;N;;;;;
34379:E007C;TAG VERTICAL LINE;Cf;0;BN;;;;;N;;;;;
34380:E007D;TAG RIGHT CURLY BRACKET;Cf;0;BN;;;;;N;;;;;
34381:E007E;TAG TILDE;Cf;0;BN;;;;;N;;;;;
34382:E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;
====
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Sep 2, 2022 at 12:17 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Thu, 01 Sep 2022 18:22:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Thu, 1 Sep 2022 15:00:38 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
UnicodeData.txt has this:
200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;So maybe we need to take Cf characters in this file into account, in
addition to Me and Mn (combining characters).Including them into unicode_combining_table.h actually worked, but I'm
not sure it is valid to include Cf's among Mn/Me's..
Looking at the definition, Cf means "other, format" category, "Format
character that affects the layout of text or the operation of text
processes, but is not normally rendered". [1]
UnicodeData.txt
174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;Soft-hyphen seems like not zero-width.. usually...
I gather it only appears at line breaks, which I doubt we want to handle.
0600;ARABIC NUMBER SIGN;Cf;0;AN;;;;;N;;;;;
110BD;KAITHI NUMBER SIGN;Cf;0;L;;;;;N;;;;;Mmm. These looks like not zero-width?
There are glyphs, but there is something special about the first one:
select U&'\0600';
Looks like this in psql (substituting 'X' to avoid systemic differences):
+----------+
| ?column? |
+----------+
| X |
+----------+
(1 row)
Copy from psql to vim or nano:
+----------+
| ?column? |
+----------+
| X |
+----------+
(1 row)
...so it does mess up the border the same way. The second
(U&'\+0110bd') doesn't render for me.
However, it seems like basically a win if we include "Cf"s to the
"combining" table?
There seems to be a case for that. If we did include those, we should
rename the table to match.
I found this old document from 2002 on "default ignorable" characters
that normally have no visible glyph:
https://unicode.org/L2/L2002/02368-default-ignorable.html
If there is any doubt about including all of Cf, we could also just
add a branch in wchar.c to hard-code the 200B-200F range.
--
John Naylor
EDB: http://www.enterprisedb.com
At Fri, 2 Sep 2022 13:43:50 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
On Fri, Sep 2, 2022 at 12:17 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:Including them into unicode_combining_table.h actually worked, but I'm
not sure it is valid to include Cf's among Mn/Me's..Looking at the definition, Cf means "other, format" category, "Format
character that affects the layout of text or the operation of text
processes, but is not normally rendered". [1]UnicodeData.txt
174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;Soft-hyphen seems like not zero-width.. usually...
I gather it only appears at line breaks, which I doubt we want to handle.
Yeah. Sounds reasonable. (Emacs always renders it, though..)
0600;ARABIC NUMBER SIGN;Cf;0;AN;;;;;N;;;;;
110BD;KAITHI NUMBER SIGN;Cf;0;L;;;;;N;;;;;Mmm. These looks like not zero-width?
There are glyphs, but there is something special about the first one:
select U&'\0600';
Looks like this in psql (substituting 'X' to avoid systemic differences):
+----------+
| ?column? |
+----------+
| X |
+----------+
(1 row)Copy from psql to vim or nano:
+----------+
| ?column? |
+----------+
| X |
+----------+
(1 row)...so it does mess up the border the same way. The second
(U&'\+0110bd') doesn't render for me.
Anyway it is inevitably rendering-environment dependent.
However, it seems like basically a win if we include "Cf"s to the
"combining" table?There seems to be a case for that. If we did include those, we should
rename the table to match.
Agreed:)
I found this old document from 2002 on "default ignorable" characters
that normally have no visible glyph:
Mmm. Too old?
If there is any doubt about including all of Cf, we could also just
add a branch in wchar.c to hard-code the 200B-200F range.
If every way has defect to the similar extent, I think we will choose
to use authoritative data at least for the first step. We might want
to have additional filtering on it but it would be another issue,
maybe.
Attached is the first cut of that. (The commit messages is not great,
though.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v1-0001-Treat-Unicode-characters-of-category-Format-as-no.patchtext/x-patch; charset=us-asciiDownload
From fefff124798c3b716a64893ce04e9331b301a379 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Sep 2022 17:16:03 +0900
Subject: [PATCH v1] Treat Unicode characters of category "Format" as
non-spacing
At PostgreSQL12, the table for non-spacing characters that
ucs_wcwidth() uses was updated based on the UnicodeData.txt so that it
consists of the characters of category Mn and Me. By this change the
table loses some actually zero-width characters. This change broke
indentation of psql outputs when these characters are involved.
Add all characters of category Cf (Format) which are not normally
rendered. However, some corner-case can be left alone.
SOFT_HYPHEN (U+00AD) can be rendered for some conditions but we don't
break a line at it so we assume it is not rendered in psql output.
ARABIC_NUMBER_SIGN(U+0600), KAITHI_NUMBER_SIGN (U+110BD) and some
similar characters seems to have glyphs so they might cause the
calculation wrong.
---
src/common/unicode/Makefile | 4 +--
...l => generate-unicode_nonspacing_table.pl} | 10 +++---
src/common/wchar.c | 8 ++---
...ing_table.h => unicode_nonspacing_table.h} | 33 ++++++++++++-------
4 files changed, 32 insertions(+), 23 deletions(-)
rename src/common/unicode/{generate-unicode_combining_table.pl => generate-unicode_nonspacing_table.pl} (68%)
rename src/include/common/{unicode_combining_table.h => unicode_nonspacing_table.h} (91%)
diff --git a/src/common/unicode/Makefile b/src/common/unicode/Makefile
index 60e01e748f..382da476cf 100644
--- a/src/common/unicode/Makefile
+++ b/src/common/unicode/Makefile
@@ -18,7 +18,7 @@ LIBS += $(PTHREAD_LIBS)
# By default, do nothing.
all:
-update-unicode: unicode_norm_table.h unicode_combining_table.h unicode_east_asian_fw_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
+update-unicode: unicode_norm_table.h unicode_nonspacing_table.h unicode_east_asian_fw_table.h unicode_normprops_table.h unicode_norm_hashfunc.h
mv $^ $(top_srcdir)/src/include/common/
$(MAKE) normalization-check
@@ -35,7 +35,7 @@ unicode_norm_hashfunc.h: unicode_norm_table.h
unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt CompositionExclusions.txt
$(PERL) $<
-unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
+unicode_nonspacing_table.h: generate-unicode_nonspacing_table.pl UnicodeData.txt
$(PERL) $^ >$@
unicode_east_asian_fw_table.h: generate-unicode_east_asian_fw_table.pl EastAsianWidth.txt
diff --git a/src/common/unicode/generate-unicode_combining_table.pl b/src/common/unicode/generate-unicode_nonspacing_table.pl
similarity index 68%
rename from src/common/unicode/generate-unicode_combining_table.pl
rename to src/common/unicode/generate-unicode_nonspacing_table.pl
index 8177c20260..ba97b936b4 100644
--- a/src/common/unicode/generate-unicode_combining_table.pl
+++ b/src/common/unicode/generate-unicode_nonspacing_table.pl
@@ -15,9 +15,9 @@ my $prev_codepoint;
my $count = 0;
print
- "/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */\n\n";
+ "/* generated by src/common/unicode/generate-unicode_nonspacing_table.pl, do not edit */\n\n";
-print "static const struct mbinterval combining[] = {\n";
+print "static const struct mbinterval nonspacing[] = {\n";
foreach my $line (<ARGV>)
{
@@ -25,9 +25,9 @@ foreach my $line (<ARGV>)
my @fields = split ';', $line;
$codepoint = hex $fields[0];
- if ($fields[2] eq 'Me' || $fields[2] eq 'Mn')
+ if ($fields[2] eq 'Me' || $fields[2] eq 'Mn' || $fields[2] eq 'Cf')
{
- # combining character, save for start of range
+ # non-spacing character, save for start of range
if (!defined($range_start))
{
$range_start = $codepoint;
@@ -35,7 +35,7 @@ foreach my $line (<ARGV>)
}
else
{
- # not a combining character, print out previous range if any
+ # not a non-spacing character, print out previous range if any
if (defined($range_start))
{
printf "\t{0x%04X, 0x%04X},\n", $range_start, $prev_codepoint;
diff --git a/src/common/wchar.c b/src/common/wchar.c
index fa8854d9e9..3b3fc53009 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -620,7 +620,7 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
* value of -1.
*
* - Non-spacing and enclosing combining characters (general
- * category code Mn or Me in the Unicode database) have a
+ * category code Mn, Me or Cf in the Unicode database) have a
* column width of 0.
*
* - Spacing characters in the East Asian Wide (W) or East Asian
@@ -638,7 +638,7 @@ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
static int
ucs_wcwidth(pg_wchar ucs)
{
-#include "common/unicode_combining_table.h"
+#include "common/unicode_nonspacing_table.h"
#include "common/unicode_east_asian_fw_table.h"
/* test for 8-bit control characters */
@@ -657,8 +657,8 @@ ucs_wcwidth(pg_wchar ucs)
* factor for display width leads to the correct behavior, so do that
* search first.
*/
- if (mbbisearch(ucs, combining,
- sizeof(combining) / sizeof(struct mbinterval) - 1))
+ if (mbbisearch(ucs, nonspacing,
+ sizeof(nonspacing) / sizeof(struct mbinterval) - 1))
return 0;
/* binary search in table of wide characters */
diff --git a/src/include/common/unicode_combining_table.h b/src/include/common/unicode_nonspacing_table.h
similarity index 91%
rename from src/include/common/unicode_combining_table.h
rename to src/include/common/unicode_nonspacing_table.h
index de1eab3a95..7605712abc 100644
--- a/src/include/common/unicode_combining_table.h
+++ b/src/include/common/unicode_nonspacing_table.h
@@ -1,6 +1,7 @@
-/* generated by src/common/unicode/generate-unicode_combining_table.pl, do not edit */
+/* generated by src/common/unicode/generate-unicode_nonspacing_table.pl, do not edit */
-static const struct mbinterval combining[] = {
+static const struct mbinterval nonspacing[] = {
+ {0x00AD, 0x00AD},
{0x0300, 0x036F},
{0x0483, 0x0489},
{0x0591, 0x05BD},
@@ -8,13 +9,16 @@ static const struct mbinterval combining[] = {
{0x05C1, 0x05C2},
{0x05C4, 0x05C5},
{0x05C7, 0x05C7},
+ {0x0600, 0x0605},
{0x0610, 0x061A},
+ {0x061C, 0x061C},
{0x064B, 0x065F},
{0x0670, 0x0670},
- {0x06D6, 0x06DC},
+ {0x06D6, 0x06DD},
{0x06DF, 0x06E4},
{0x06E7, 0x06E8},
{0x06EA, 0x06ED},
+ {0x070F, 0x070F},
{0x0711, 0x0711},
{0x0730, 0x074A},
{0x07A6, 0x07B0},
@@ -25,9 +29,8 @@ static const struct mbinterval combining[] = {
{0x0825, 0x0827},
{0x0829, 0x082D},
{0x0859, 0x085B},
- {0x0898, 0x089F},
- {0x08CA, 0x08E1},
- {0x08E3, 0x0902},
+ {0x0890, 0x089F},
+ {0x08CA, 0x0902},
{0x093A, 0x093A},
{0x093C, 0x093C},
{0x0941, 0x0948},
@@ -114,8 +117,7 @@ static const struct mbinterval combining[] = {
{0x17C6, 0x17C6},
{0x17C9, 0x17D3},
{0x17DD, 0x17DD},
- {0x180B, 0x180D},
- {0x180F, 0x180F},
+ {0x180B, 0x180F},
{0x1885, 0x1886},
{0x18A9, 0x18A9},
{0x1920, 0x1922},
@@ -152,6 +154,9 @@ static const struct mbinterval combining[] = {
{0x1CF4, 0x1CF4},
{0x1CF8, 0x1CF9},
{0x1DC0, 0x1DFF},
+ {0x200B, 0x200F},
+ {0x202A, 0x202E},
+ {0x2060, 0x206F},
{0x20D0, 0x20F0},
{0x2CEF, 0x2CF1},
{0x2D7F, 0x2D7F},
@@ -196,6 +201,8 @@ static const struct mbinterval combining[] = {
{0xFB1E, 0xFB1E},
{0xFE00, 0xFE0F},
{0xFE20, 0xFE2F},
+ {0xFEFF, 0xFEFF},
+ {0xFFF9, 0xFFFB},
{0x101FD, 0x101FD},
{0x102E0, 0x102E0},
{0x10376, 0x1037A},
@@ -213,7 +220,8 @@ static const struct mbinterval combining[] = {
{0x1107F, 0x11081},
{0x110B3, 0x110B6},
{0x110B9, 0x110BA},
- {0x110C2, 0x110C2},
+ {0x110BD, 0x110BD},
+ {0x110C2, 0x110CD},
{0x11100, 0x11102},
{0x11127, 0x1112B},
{0x1112D, 0x11134},
@@ -281,15 +289,16 @@ static const struct mbinterval combining[] = {
{0x11D95, 0x11D95},
{0x11D97, 0x11D97},
{0x11EF3, 0x11EF4},
+ {0x13430, 0x13438},
{0x16AF0, 0x16AF4},
{0x16B30, 0x16B36},
{0x16F4F, 0x16F4F},
{0x16F8F, 0x16F92},
{0x16FE4, 0x16FE4},
{0x1BC9D, 0x1BC9E},
- {0x1CF00, 0x1CF46},
+ {0x1BCA0, 0x1CF46},
{0x1D167, 0x1D169},
- {0x1D17B, 0x1D182},
+ {0x1D173, 0x1D182},
{0x1D185, 0x1D18B},
{0x1D1AA, 0x1D1AD},
{0x1D242, 0x1D244},
@@ -304,5 +313,5 @@ static const struct mbinterval combining[] = {
{0x1E2EC, 0x1E2EF},
{0x1E8D0, 0x1E8D6},
{0x1E944, 0x1E94A},
- {0xE0100, 0xE01EF},
+ {0xE0001, 0xE01EF},
};
--
2.31.1
On 2022-Sep-02, Kyotaro Horiguchi wrote:
UnicodeData.txt
174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;Soft-hyphen seems like not zero-width.. usually...
Soft-hyphen *is* zero width. It should not be displayed. It's just a
marker so that typesetting software knows where to add real hyphens in
case a word is too long to appear in a single line.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Fri, Sep 2, 2022 at 3:19 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Fri, 2 Sep 2022 13:43:50 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
If there is any doubt about including all of Cf, we could also just
add a branch in wchar.c to hard-code the 200B-200F range.If every way has defect to the similar extent, I think we will choose
to use authoritative data at least for the first step. We might want
to have additional filtering on it but it would be another issue,
maybe.Attached is the first cut of that. (The commit messages is not great,
though.)
Okay, the patch looks good to me overall. Comparing releases, some
other ranges were in v11 but left out in v12 with the transition to
using a script:
0x070F
{0x200B, 0x200F}
{0x202A, 0x202E}
{0x206A, 0x206F}
0xFEFF
{0xFFF9, 0xFFFB}
Does anyone want to advocate for backpatching these missing ranges to
v12 and up? v12 still has a table in-line so trivial to remedy, but
v13 and up use a script, so these exceptions would likely have to use
hard-coded branches to keep from bringing in new changes.
If so, does anyone want to advocate for including this patch in v15?
It claims Unicode 14.0.0, and this would make that claim more
technically correct as well as avoiding additional branches.
--
John Naylor
EDB: http://www.enterprisedb.com
čt 8. 9. 2022 v 7:39 odesílatel John Naylor <john.naylor@enterprisedb.com>
napsal:
On Fri, Sep 2, 2022 at 3:19 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Fri, 2 Sep 2022 13:43:50 +0700, John Naylor <
john.naylor@enterprisedb.com> wrote in
If there is any doubt about including all of Cf, we could also just
add a branch in wchar.c to hard-code the 200B-200F range.If every way has defect to the similar extent, I think we will choose
to use authoritative data at least for the first step. We might want
to have additional filtering on it but it would be another issue,
maybe.Attached is the first cut of that. (The commit messages is not great,
though.)Okay, the patch looks good to me overall. Comparing releases, some
other ranges were in v11 but left out in v12 with the transition to
using a script:0x070F
{0x200B, 0x200F}
{0x202A, 0x202E}
{0x206A, 0x206F}
0xFEFF
{0xFFF9, 0xFFFB}Does anyone want to advocate for backpatching these missing ranges to
v12 and up? v12 still has a table in-line so trivial to remedy, but
v13 and up use a script, so these exceptions would likely have to use
hard-coded branches to keep from bringing in new changes.If so, does anyone want to advocate for including this patch in v15?
It claims Unicode 14.0.0, and this would make that claim more
technically correct as well as avoiding additional branches.
I think it can be fixed just in v15 and master. This issue has no impact
on SQL.
Thank you for fixing this issue
Regards
Pavel
Show quoted text
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 8, 2022 at 12:51 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
Does anyone want to advocate for backpatching these missing ranges to
v12 and up? v12 still has a table in-line so trivial to remedy, but
v13 and up use a script, so these exceptions would likely have to use
hard-coded branches to keep from bringing in new changes.If so, does anyone want to advocate for including this patch in v15?
It claims Unicode 14.0.0, and this would make that claim more
technically correct as well as avoiding additional branches.I think it can be fixed just in v15 and master. This issue has no impact on SQL.
Well, if the regressions from v11 are not important enough to
backpatch, there is not as much of a case to backpatch the full fix to
v15 either.
--
John Naylor
EDB: http://www.enterprisedb.com
po 12. 9. 2022 v 7:37 odesílatel John Naylor <john.naylor@enterprisedb.com>
napsal:
On Thu, Sep 8, 2022 at 12:51 PM Pavel Stehule <pavel.stehule@gmail.com>
wrote:Does anyone want to advocate for backpatching these missing ranges to
v12 and up? v12 still has a table in-line so trivial to remedy, but
v13 and up use a script, so these exceptions would likely have to use
hard-coded branches to keep from bringing in new changes.If so, does anyone want to advocate for including this patch in v15?
It claims Unicode 14.0.0, and this would make that claim more
technically correct as well as avoiding additional branches.I think it can be fixed just in v15 and master. This issue has no
impact on SQL.
Well, if the regressions from v11 are not important enough to
backpatch, there is not as much of a case to backpatch the full fix to
v15 either.
This is not a critical issue, really. On second thought, I don't see the
point in releasing fresh Postgres with this bug, where there is know bugfix
- and this bugfix should be compatible (at this moment) with 16.
PostgreSQL 15 was not released yet.
Regards
Pavel
--
Show quoted text
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Sep 12, 2022 at 12:44 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
This is not a critical issue, really. On second thought, I don't see the point in releasing fresh Postgres with this bug, where there is know bugfix - and this bugfix should be compatible (at this moment) with 16.
I agree the actual logic/data change is low-risk. The patch renames
two files, which seems a bit much this late in the cycle. Maybe that's
okay, but I'd like someone else to opine before doing so.
--
John Naylor
EDB: http://www.enterprisedb.com
po 12. 9. 2022 v 10:29 odesílatel John Naylor <john.naylor@enterprisedb.com>
napsal:
On Mon, Sep 12, 2022 at 12:44 PM Pavel Stehule <pavel.stehule@gmail.com>
wrote:This is not a critical issue, really. On second thought, I don't see
the point in releasing fresh Postgres with this bug, where there is know
bugfix - and this bugfix should be compatible (at this moment) with 16.I agree the actual logic/data change is low-risk. The patch renames
two files, which seems a bit much this late in the cycle. Maybe that's
okay, but I'd like someone else to opine before doing so.
understand
Pavel
Show quoted text
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Sep 8, 2022 at 12:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Sep 2, 2022 at 3:19 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Fri, 2 Sep 2022 13:43:50 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
If there is any doubt about including all of Cf, we could also just
add a branch in wchar.c to hard-code the 200B-200F range.If every way has defect to the similar extent, I think we will choose
to use authoritative data at least for the first step. We might want
to have additional filtering on it but it would be another issue,
maybe.Attached is the first cut of that. (The commit messages is not great,
though.)Okay, the patch looks good to me overall.
As discussed, I pushed to master only, with only one additional
comment in the perl script to describe Me/Mn/Cf.
--
John Naylor
EDB: http://www.enterprisedb.com