collate not support Unicode Variation Selector

Started by 荒井元成over 3 years ago10 messages
#1荒井元成
n2029@ndensan.co.jp

Hi, I hope supported Unicode Variation Selector on collate.

D209007=# create table ivstesticu (
D209007(# moji text
D209007(# );
D209007=# create table ivstest (
D209007(# moji text collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY D209007(# );
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D');
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D');
D209007=# select moji from ivstest where moji like '%' || U&'\+00304B' || '%';
-------------
㐶󠄁き
㐶き
(2 行)

expected
-------------
㐶き
(1 行)

Best regards,

#2荒井元成
n2029@ndensan.co.jp
In reply to: 荒井元成 (#1)
RE: collate not support Unicode Variation Selector

Hi, I hope supported Unicode Variation Selector on collate.

I will resend it because there was a typo.

D209007=# create table ivstest ( moji text collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY );
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D');
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D');
D209007=# select moji from ivstest where moji like '%' || U&'\+003436' || '%';
-------------
㐶󠄁き
㐶き
(2 行)

expected
-------------
㐶き
(1 行)

Best regards,

#3Thomas Munro
thomas.munro@gmail.com
In reply to: 荒井元成 (#2)
Re: collate not support Unicode Variation Selector

On Wed, Aug 3, 2022 at 12:09 PM 荒井元成 <n2029@ndensan.co.jp> wrote:

D209007=# create table ivstest ( moji text collate "ja-x-icu" CONSTRAINT firstkey PRIMARY KEY );
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+0E0101' || U&'\+00304D');
D209007=# insert into ivstest (moji) values ( U&'\+003436' || U&'\+00304D');
D209007=# select moji from ivstest where moji like '%' || U&'\+003436' || '%';
-------------
㐶󠄁き
㐶き
(2 行)

expected
-------------
㐶き
(1 行)

So you want to match only strings that contain U&'\+003436' *not*
followed by a variation selector (as we also discussed at [1]/messages/by-id/013f01d873bb$ff5f64b0$fe1e2e10$@ndensan.co.jp). I'm
pretty sure that everything in PostgreSQL considers variation
selectors to be separate characters. Perhaps it is possible to write
a regular expression covering the variation selector ranges, something
like '\U00003436[^\U000E0100-\U000E010EF]'?

Here's an example using Latin characters that are easier for me, but
show approximately the same thing, since variation selectors are a bit
like "combining" characters:

postgres=# create table t (x text);
CREATE TABLE
postgres=# insert into t values ('e'), ('ef'), ('e' || U&'\0301');
INSERT 0 3
postgres=# select * from t;
x
----
e
ef

(3 rows)

postgres=# select * from t where x ~ 'e([^\u0300-\u036f]|$)';
x
----
e
ef
(2 rows)

[1]: /messages/by-id/013f01d873bb$ff5f64b0$fe1e2e10$@ndensan.co.jp

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#3)
Re: collate not support Unicode Variation Selector

Thomas Munro <thomas.munro@gmail.com> writes:

So you want to match only strings that contain U&'\+003436' *not*
followed by a variation selector (as we also discussed at [1]). I'm
pretty sure that everything in PostgreSQL considers variation
selectors to be separate characters.

There might be something that doesn't, but LIKE certainly isn't it.
I don't believe plain LIKE is collation-aware at all, it just sees
characters to match or not match. ILIKE is a little collation-aware,
but it's still not going to consider a combining sequence as one
character. The same for the regex operators.

Maybe it would help if you run the strings through normalize() first?
I'm not sure if that can combine combining characters.

regards, tom lane

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#4)
Re: collate not support Unicode Variation Selector

On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it would help if you run the strings through normalize() first?
I'm not sure if that can combine combining characters.

I think the similarity between Latin combining characters and these
ideographic variations might end there. I don't think there is a
single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.
This system is for controlling small differences in rendering for the
"same" character[1]http://www.unicode.org/reports/tr37/tr37-14.html. My computer doesn't even show the OP's example
glyphs as different (to my eyes, at least; I can see on a random
picture I found[2]https://glyphwiki.org/wiki/u3436 that the one with the e0101 selector is supposed to
have a ... what do you call that ... a tiny gap :-)).

[1]: http://www.unicode.org/reports/tr37/tr37-14.html
[2]: https://glyphwiki.org/wiki/u3436

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#5)
Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it would help if you run the strings through normalize() first?
I'm not sure if that can combine combining characters.

I think the similarity between Latin combining characters and these
ideographic variations might end there. I don't think there is a
single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.

Right. At least in Japanese texts, the two "character"s are the same
glyph. In that sense the loss of variation selectors from a text
doesn't alter its meaning and doesn't hurt correctness at all.
Ideographic variation is useful in special cases where their
ideographic identity is crucial.

This system is for controlling small differences in rendering for the
"same" character[1]. My computer doesn't even show the OP's example
glyphs as different (to my eyes, at least; I can see on a random
picture I found[2] that the one with the e0101 selector is supposed to
have a ... what do you call that ... a tiny gap :-)).

They need variation-aware fonts and application support to render. So
when even *I* see the two characters on Excel (which I believe doesn't
have that support by default), they would look exactly same. In that
sense, my opinion on the behavior is that all ideographic variations
rather should be treated as the same character in searching in general
context. In other words, text matching should just drop variation
selectors as the default behavior.

ICU:Collator [1]https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html has the notion of "collation strength" and I saw in
an article that only Colator::IDENTICAL among five alternatives makes
distinction between ideographic variations of a glyph.

[1] http://www.unicode.org/reports/tr37/tr37-14.html
[2] https://glyphwiki.org/wiki/u3436

[1]: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7荒井元成
n2029@ndensan.co.jp
In reply to: Kyotaro Horiguchi (#6)
RE: collate not support Unicode Variation Selector

Thank you for your reply.

About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications.
It should be able to handle all characters.

regards.

-----Original Message-----
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Wednesday, August 3, 2022 3:26 PM
To: thomas.munro@gmail.com
Cc: tgl@sss.pgh.pa.us; n2029@ndensan.co.jp; pgsql-hackers@lists.postgresql.org
Subject: Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 14:02:08 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in

On Wed, Aug 3, 2022 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it would help if you run the strings through normalize() first?
I'm not sure if that can combine combining characters.

I think the similarity between Latin combining characters and these
ideographic variations might end there. I don't think there is a
single codepoint version of U&'\+003436' || U&'\+0E0101', unlike é.

Right. At least in Japanese texts, the two "character"s are the same glyph. In that sense the loss of variation selectors from a text doesn't alter its meaning and doesn't hurt correctness at all.
Ideographic variation is useful in special cases where their ideographic identity is crucial.

This system is for controlling small differences in rendering for the
"same" character[1]. My computer doesn't even show the OP's example
glyphs as different (to my eyes, at least; I can see on a random
picture I found[2] that the one with the e0101 selector is supposed to
have a ... what do you call that ... a tiny gap :-)).

They need variation-aware fonts and application support to render. So when even *I* see the two characters on Excel (which I believe doesn't have that support by default), they would look exactly same. In that sense, my opinion on the behavior is that all ideographic variations rather should be treated as the same character in searching in general context. In other words, text matching should just drop variation selectors as the default behavior.

ICU:Collator [1]https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html has the notion of "collation strength" and I saw in an article that only Colator::IDENTICAL among five alternatives makes distinction between ideographic variations of a glyph.

[1] http://www.unicode.org/reports/tr37/tr37-14.html
[2] https://glyphwiki.org/wiki/u3436

[1]: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: 荒井元成 (#7)
Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in

Thank you for your reply.

About 60,000 characters are registered in the IPAmj Mincho font designated by the national specifications.
It should be able to handle all characters.

Yeah, it is one of that fonts. But I didn't know that MS-Word can
*display* ideographic variations. But it is dissapoinging that input
requires to copy-paste from the Web.. Maybe that characters can be
input smoothly by using ATOK or alikes..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9荒井元成
n2029@ndensan.co.jp
In reply to: Kyotaro Horiguchi (#8)
RE: collate not support Unicode Variation Selector

Thank you for your reply.

SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to
support them as well.

Regards.

--
Motonari Arai

-----Original Message-----
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Thursday, August 4, 2022 5:24 PM
To: n2029@ndensan.co.jp
Cc: thomas.munro@gmail.com; tgl@sss.pgh.pa.us;
pgsql-hackers@lists.postgresql.org
Subject: Re: collate not support Unicode Variation Selector

At Wed, 3 Aug 2022 20:12:53 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in

Thank you for your reply.

About 60,000 characters are registered in the IPAmj Mincho font designated

by the national specifications.

It should be able to handle all characters.

Yeah, it is one of that fonts. But I didn't know that MS-Word can
*display* ideographic variations. But it is dissapoinging that input
requires to copy-paste from the Web.. Maybe that characters can be input
smoothly by using ATOK or alikes..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: 荒井元成 (#9)
Re: collate not support Unicode Variation Selector

At Thu, 4 Aug 2022 19:01:33 +0900, 荒井元成 <n2029@ndensan.co.jp> wrote in

Thank you for your reply.

SQLServer supports Unicode Variation Selector, so I would like PostgreSQL to
support them as well.

I studied the code a bit further, then found that simple comparison
can ignore selectors by using nondeterministic collation.

CREATE COLLATION col1 (provider=icu, locale='ja', deterministic=false);
SELECT (U&'\+003436' || U&'\+0E0101' || U&'\+00304D' collate col1) = U&'\+003436' || U&'\+00304D';
?column?
----------
t

However LIKE dislikes this.

ERROR: nondeterministic collations are not supported for LIKE

Deterministic collations assume text equality means bytewise
equality. So, the "problem" behavior is correct in a sense. In that
sense, those functions that do not support nondeterministic collations
can be implemented without considering ICU, which leads to the
"problem" behavior. ICU has regular expression function so LIKE might
be ableto be implemented using this. If it is done, and if a
non-deterministic IVS-sensitive collation is available (I didin't find
how to get one..), LIKE would work as you expect.

But..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center