Unaccent characters

Started by Thom Brownalmost 14 years ago2 messagesgeneral
Jump to latest
#1Thom Brown
thom@linux.com

Hi,

I had a look at the unaccent.rules file and noticed the following
characters aren't properly converted:

ß (U+00DF) An eszett represents a double-s "SS" but this replaces it
with one "S". Shouldn't this be replace with "SS"?

Æ (U+00C6) and æ (U+00E6) These doesn't have an accent, diacritic or
anything added to a single latin character. It's simply a ligature of
"A" and "E" or "a" and "e". If someone has the text "æther", I would
imagine they'd be surprised at it being converted to "ather" instead
of "aether".

Œ (U+0152) and œ (U+0153). Same as above. This is a ligature of "O"
and "E" or "o" and "e". Except this time the unaccent module chooses
the 2nd character instead of the 1st which is confusing.

If these were properly converted it would change the length of the
text, so I'm wondering if that's the reason for not properly
converting them. Could someone elaborate?

--
Thom

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Thom Brown (#1)
Re: Unaccent characters

On fre, 2012-04-20 at 09:15 +0100, Thom Brown wrote:

I had a look at the unaccent.rules file and noticed the following
characters aren't properly converted:

ß (U+00DF) An eszett represents a double-s "SS" but this replaces it
with one "S". Shouldn't this be replace with "SS"?

Probably, but it certainly shouldn't be upper case.

Æ (U+00C6) and æ (U+00E6) These doesn't have an accent, diacritic or
anything added to a single latin character. It's simply a ligature of
"A" and "E" or "a" and "e". If someone has the text "æther", I would
imagine they'd be surprised at it being converted to "ather" instead
of "aether".

It depends on what the point of this module is supposed to be. Doing
"unaccenting" usefully depends on language and context. For example, it
would be very reasonable to map æ to ae, but in a Scandinavian context,
æ is equivalent to ä, which is mapped to a, which is itself
questionable.

Œ (U+0152) and œ (U+0153). Same as above. This is a ligature of "O"
and "E" or "o" and "e". Except this time the unaccent module chooses
the 2nd character instead of the 1st which is confusing.

That certainly seems wrong. It's also worth noting that while æ is in
some languages considered a separate letter, œ is generally just a
typographical ligature.