BUG #15548: Unaccent does not remove combining diacritical characters
The following bug has been logged on the website:
Bug reference: 15548
Logged by: Hugh Ranalli
Email address: hugh@whtc.ca
PostgreSQL version: 11.1
Operating system: Ubuntu 18.04
Description:
Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks), in which case
the mark applies itself to the preceding character. For example, A followed
by U+0300 displays À. However, unaccent is not removing these accents.
SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in
'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL
repositories. I've read bug report #13440, and have tried with both the
installed unaccent.rules as well as a new set generated by the
generate_unaccent_rules.py distributed with the 11.1 source code:
wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt
wget
https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml
python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml > unaccent.rules
I see there have been some updates to generate_unaccent_rules.py to handle
Greek and Vietnamese characters, but neither of them seem to address this
issue. I'm happy to contribute a patch to handle these cases, but of course
wanted to make sure this is desired behaviour, or if I am misunderstanding
something somewhere.
Thank you,
Hugh Ranalli
PG Bug reporting form wrote:
Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)
Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence
In general, PostgreSQL leaves it to applications to normalize
Unicode strings so that they are all in the same canonical form,
either composed or decomposed.
the mark applies itself to the preceding character. For example, A
followed by U+0300 displays À. However, unaccent is not removing
these accents.
Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
unaccent(unicode_NFC(string))
Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes:
PG Bug reporting form wrote:
... For example, A
followed by U+0300 displays À. However, unaccent is not removing
these accents.
Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
unaccent(unicode_NFC(string))
That might be worthwhile, but it seems independent of this issue.
Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.
Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.
regards, tom lane
Tom Lane wrote:
Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.
In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :
"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"
Incidentally we may want to improve this bit of doc to mention
explicitly the Unicode decomposed forms as a use case for
removing characters. In fact I wonder if that's not what it's
already trying to express, but confusing "languages" with "forms".
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:
Tom Lane wrote:
Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.
That's what I was thinking. Given that the accent is separate from the
characters, simply dropping it should result in the correct unaccented
character.
In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"
Yes, I had read that in the docs, and that's the approach I planned to
take. I'll go ahead and develop a patch, then.
Best wishes,
Hugh
Show quoted text
I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.
I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.
Please let me know if you see anything I need to change.
Best wishes,
Hugh
--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957
w: www.whtc.ca
On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh@whtc.ca> wrote:
Show quoted text
On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:
Tom Lane wrote:
Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.That's what I was thinking. Given that the accent is separate from the
characters, simply dropping it should result in the correct unaccented
character.In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"Yes, I had read that in the docs, and that's the approach I planned to
take. I'll go ahead and develop a patch, then.Best wishes,
Hugh
Attachments:
remove-combining-diacritical-accents-in-unaccent.rules.patchtext/x-patch; charset=UTF-8; name=remove-combining-diacritical-accents-in-unaccent.rules.patchDownload+30-1
Hugh Ranalli <hugh@whtc.ca> writes:
I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.
Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.
Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?
regards, tom lane
On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/
Done.
Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
categories (
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.
In looking more closely, I also see that script isn't generating ligatures,
even though it should, because although the program can generate them, none
of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so
they are skipped.
This could probably be handled by adding the ligature ranges to the defined
ranges. Symbol types could be added to the types it looks at, and perhaps
the codepoint ranges collapsed into one list, as the IDs are unique across
all categories. I don't think we'd want to just rely on ranges, as that
could include control characters, punctuation, etc.
There are a number of other characters that appear in unaccent.rules that
aren't generated by the script. I've attached a diff of the output of
generate_unaccent_rules (using the version before my changes, to simplify
matters) and unaccent.rules. Unfortunately, I don't know how to interpret
most of these characters.
I suppose it's valid to ask if changing © to (C) is even something an
"unaccent" function should do. Given that it's in the existing rules file,
should it be supported as an existing behaviour?
Sorry for more questions than answers. ;-)
Attachments:
unaccent.difftext/x-patch; charset=UTF-8; name=unaccent.diffDownload+0-650
Hugh Ranalli <hugh@whtc.ca> writes:
On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
categories (
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.
Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do
python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules
I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.
I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
regards, tom lane
I wrote:
... I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.
I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
A few minutes later on a Fedora 28 box: python 2.7.15 also gives me the
expected results, while python 3.6.6 fails with "SyntaxError: invalid
syntax".
So updating that script to also work with python3 might be a worthwhile
TODO item. But I'm at a loss to explain why you get different results.
regards, tom lane
On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do
python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml >newrulesI get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.
Well, that's embarrassing. When I looked I couldn't see anything that
looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
We use other versions of 2.7 on our production platforms. I'll take another
look, and check the URLs I am using.
On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh@whtc.ca> wrote:
On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and dopython2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml >newrulesI get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.Well, that's embarrassing. When I looked I couldn't see anything that
looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
We use other versions of 2.7 on our production platforms. I'll take another
look, and check the URLs I am using.
The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?
Hugh Ranalli <hugh@whtc.ca> writes:
The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set.
Ah-hah.
I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?
(b) seems sufficient to me, but perhaps someone else has a different
opinion.
Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.
regards, tom lane
On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set.Ah-hah.
I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?(b) seems sufficient to me, but perhaps someone else has a different
opinion.Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.
+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().
--
Thomas Munro
http://www.enterprisedb.com
On Sat, 15 Dec 2018 at 21:26, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().
Well, when I modified the code to handle the new version of the
transliteration file, I discovered that was sufficient to handle the old
version as well. That's not the way things usually go, but I'll take it. ;-)
I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file. I'll
be happy to add these to the CF. Does anyone need to review them and give
me approval before I do so?
Best wishes,
Hugh
Hugh Ranalli <hugh@whtc.ca> writes:
I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file.
I think you forgot the patches?
I'll
be happy to add these to the CF. Does anyone need to review them and give
me approval before I do so?
Nope.
regards, tom lane
On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file.I think you forgot the patches?
Sigh, yes I did. That's what I get for trying to get this sent out before
heading to an appointment. Patches attached and will add to CF. Let me know
if you see anything amiss.
Hugh
Attachments:
unaccent.rules-update-to-Latin-ASCII-CDLR-v34.patchtext/x-patch; charset=UTF-8; name=unaccent.rules-update-to-Latin-ASCII-CDLR-v34.patchDownload+15-0
generate_unaccent_rules-handle-all-Latin-ASCII-versions.patchtext/x-patch; charset=US-ASCII; name=generate_unaccent_rules-handle-all-Latin-ASCII-versions.patchDownload+10-5
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote:
On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hugh Ranalli <hugh@whtc.ca> writes:
I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file.I think you forgot the patches?
Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.
+ʹ '
+ʺ "
+ʻ '
+ʼ '
+ʽ '
+˂ <
+˃ >
+˄ ^
+ˆ ^
+ˈ '
+ˋ `
+ː :
+˖ +
+˗ -
+˜ ~
I don't think this is quite right. Those don't seem to be the
combining codepoints[1]https://en.wikipedia.org/wiki/Combining_Diacritical_Marks, and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:
select unaccent('un café crème s''il vous plaît');
It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:
$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a ai..t');..
(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)
[1]: https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
On Tue, Dec 18, 2018 at 3:05 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote: +ʹ ' +ʺ " +ʻ ' +ʼ ' +ʽ ' +˂ < +˃ > +˄ ^ +ˆ ^ +ˈ ' +ˋ ` +ː : +˖ + +˗ - +˜ ~I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:select unaccent('un café crème s''il vous plaît');
Oh, I see now that that was just the v34 ASCII transliteration update,
and perhaps the diacritic stripping will be posted separately.
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote:
I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:select unaccent('un café crème s''il vous plaît');
It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:
Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.
--
Michael