unicode match normal forms

Started by Nonamealmost 5 years ago8 messagesgeneral
Jump to latest
#1Noname
hamann.w@t-online.de

Hi,

in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers
the decomposed form. Is there any way to make a semi-exact match that accepts both variants?
This question is not about fulltext but about matching filenames across a network - I wish to avoid two equally-looking
filenames.

Regards
Wolfgang

#2Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Noname (#1)
Re: unicode match normal forms

On 17 May 2021 13:27:40 -0000
hamann.w@t-online.de wrote:

in unicode letter ä exists in two versions - linux and windows use a
composite whereas macos prefers the decomposed form. Is there any way
to make a semi-exact match that accepts both variants?

You should probably normalise the strings in whatever application code
handles the inserting. NFC is the "usually sensible" normal form to
use.

If you can't change the application code, you may use a trigger and
apply the `normalize(text[,form])→text` function to the values

https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.2

something vaguely like (totally untested!)::

create function normalize_filename() returns trigger as $$
begin
new.filename := normalize(new.filename);
return new;
end;
$$ language plpgsql;

create trigger normalize_filename
before insert or update
on that_table
for each row
execute function normalize_filename();

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#3Matthias Apitz
guru@unixarea.de
In reply to: Noname (#1)
Re: unicode match normal forms

El día lunes, mayo 17, 2021 a las 01:27:40p. m. -0000, hamann.w@t-online.de escribió:

Hi,

in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers
the decomposed form. Is there any way to make a semi-exact match that accepts both variants?
This question is not about fulltext but about matching filenames across a network - I wish to avoid two equally-looking
filenames.

There is only *one* codepoint for the German letter a Umlaut:
LATIN SMALL LETTER A WITH DIAERESI U+00E4

Said that, having such chars (non ASCII) in file names, I count as a bad
idea.

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
¡Con Cuba no te metas! «» Don't mess with Cuba! «» Leg Dich nicht mit Kuba an!
http://www.cubadebate.cu/noticias/2020/12/25/en-video-con-cuba-no-te-metas/

#4Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Gianni Ceccarelli (#2)
Re: unicode match normal forms

On 17 May 2021 13:27:40 -0000
hamann.w@t-online.de wrote:

in unicode letter ä exists in two versions - linux and windows use a
composite whereas macos prefers the decomposed form. Is there any
way to make a semi-exact match that accepts both variants?

Actually, re-reading your request, you want to *find* both forms?

In which case, a function index may be more useful::

create index filename_unique_normalized
on that_table(normalize(filename));

then you can search::

select *
from that_table
where normalize(filename)=?

If you want to make sure that no two rows contain "equally-looking"
filenames, you can use a unique index::

create unique index filename_unique_normalized
on that_table(normalize(filename));

(while we're on the topic of "equally-looking" characters, you may
want to look at https://en.wikipedia.org/wiki/Homoglyph and
https://www.unicode.org/reports/tr36/ )

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#5Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Matthias Apitz (#3)
Re: unicode match normal forms

On Mon, 17 May 2021 15:45:00 +0200
Matthias Apitz <guru@unixarea.de> wrote:

There is only *one* codepoint for the German letter a Umlaut:
LATIN SMALL LETTER A WITH DIAERESI U+00E4

True. On the other hand, the sequence:

* U+0061 LATIN SMALL LETTER A
* U+0308 COMBINING DIAERESIS

will render exactly the same glyph. The two forms are closely related:
U+00E4 is in NFC (normalization form canonical composition), U+0061
U+0308 is in NFD (normalization form canonical decomposition).

See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#6Daniel Verite
daniel@manitou-mail.org
In reply to: Noname (#1)
Re: unicode match normal forms

Hamann W wrote:

in unicode letter ä exists in two versions - linux and windows use a
composite whereas macos prefers
the decomposed form. Is there any way to make a semi-exact match that
accepts both variants?

Aside from normalizing the strings into the same normal form
before comparing, non-deterministic ICU collations will recognize them as
identical (they're "canonically equivalent" in Unicode terms)

For instance,

CREATE COLLATION nd (
provider = 'icu',
locale='',
deterministic = false
);

SELECT
nfc_form,
nfd_form,
nfc_form = nfd_form COLLATE nd AS equal1,
nfc_form = nfd_form COLLATE "C" AS equal2 -- or any deterministic collation
FROM
(VALUES
(E'j\u00E4hrlich',
E'j\u0061\u0308hrlich'))
AS s(nfc_form, nfd_form);

nfc_form | nfd_form | equal1 | equal2
----------+----------+--------+--------
jährlich | jährlich | t | f
(1 row)

Normalizing is available as a built-in function since Postgres 13 and
non-deterministic collations appeared in Postgres 12.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

#7Noname
goldgraeber-werbetechnik@t-online.de
In reply to: Gianni Ceccarelli (#2)
Re: unicode match normal forms

Hi Gianni,

many thanks for your detailed response.
It turned out that my postgresql installation is too old for normalize,so I will probably
a) use an external script to normalize existing data
b) change application code to normalize data before inserting or searching

Regards
Wolfgang

Show quoted text

On 17 May 2021 13:27:40 -0000
hamann.w@t-online.de wrote:

in unicode letter ä exists in two versions - linux and windows use a
composite whereas macos prefers the decomposed form. Is there any way
to make a semi-exact match that accepts both variants?

You should probably normalise the strings in whatever application code

handles the inserting. NFC is the "usually sensible" normal form to
use.

If you can't change the application code, you may use a trigger and

apply the `normalize(text[,form])→text` function to the values

https://www.postgresql.org/docs/13/functions-string.html#id-1.5.8.10.5.2.2.7.1.1.2
something vaguely like (totally untested!)::
create function normalize_filename() returns trigger as $$

begin
new.filename := normalize(new.filename);
return new;
end;
$$ language plpgsql;

create trigger normalize_filename

before insert or update
on that_table
for each row
execute function normalize_filename();

-- >> Dakkar - <Mobilis in mobile>

GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#8Noname
goldgraeber-werbetechnik@t-online.de
In reply to: Matthias Apitz (#3)
Re: unicode match normal forms

El día lunes, mayo 17, 2021 a las 01:27:40p. m. -0000, hamann.w@t-online.de escribió:

Hi,

in unicode letter ä exists in two versions - linux and windows use a composite whereas macos prefers

the decomposed form. Is there any way to make a semi-exact match that accepts both variants?
This question is not about fulltext but about matching filenames across a network - I wish to avoid two equally-looking
filenames.

There is only *one* codepoint for the German letter a Umlaut:

LATIN SMALL LETTER A WITH DIAERESI U+00E4

Hi Matthias,

unfortunately there also is letter a with combining dieretic - and it is used by MacOS
The mac seems to prefer decomposed characters in other contexts as well, so in my
everyday job I used to have fun with product catalogues from a few companies.
Depending on the computer used for adding / editing a productthe relevant field could be
iso-latin-1, utf8 normal, or utf8 decomposed

Said that, having such chars (non ASCII) in file names, I count as a bad
idea.

I usually try to avoid whitespace and accented charactersin filenames, to be able to use ssh and scp
without much hassle, but I am not the user in this case.

Now, if I look at a music collection (stored as folders with mp3 files for the tracks), I would really prefer
"Einstürzende Neubauten" over Einstuerzende_Neubauten

Regards
Wolfgang

Show quoted text