BUG #15548: Unaccent does not remove combining diacritical characters

Started by PG Bug reporting formabout 7 years ago59 messages
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15548
Logged by: Hugh Ranalli
Email address: hugh@whtc.ca
PostgreSQL version: 11.1
Operating system: Ubuntu 18.04
Description:

Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks), in which case
the mark applies itself to the preceding character. For example, A followed
by U+0300 displays À. However, unaccent is not removing these accents.

SELECT unaccent(U&'A\0300'); should result in 'A', but instead results in
'À.' I'm running PostgreSQL 11.1, installed from the PostgreSQL
repositories. I've read bug report #13440, and have tried with both the
installed unaccent.rules as well as a new set generated by the
generate_unaccent_rules.py distributed with the 11.1 source code:
wget http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt
wget
https://www.unicode.org/repos/cldr/trunk/common/transforms/Latin-ASCII.xml
python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml > unaccent.rules

I see there have been some updates to generate_unaccent_rules.py to handle
Greek and Vietnamese characters, but neither of them seem to address this
issue. I'm happy to contribute a patch to handle these cases, but of course
wanted to make sure this is desired behaviour, or if I am misunderstanding
something somewhere.

Thank you,
Hugh Ranalli

#2Daniel Verite
daniel@manitou-mail.org
In reply to: PG Bug reporting form (#1)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

PG Bug reporting form wrote:

Apparently Unicode has two ways of accenting a character: as a separate code
point, which represents the base character and the accent, or as a
"combining diacritical mark"
(https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)

Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence

In general, PostgreSQL leaves it to applications to normalize
Unicode strings so that they are all in the same canonical form,
either composed or decomposed.

the mark applies itself to the preceding character. For example, A
followed by U+0300 displays À. However, unaccent is not removing
these accents.

Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
unaccent(unicode_NFC(string))

Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#2)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

"Daniel Verite" <daniel@manitou-mail.org> writes:

PG Bug reporting form wrote:

... For example, A
followed by U+0300 displays À. However, unaccent is not removing
these accents.

Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
unaccent(unicode_NFC(string))

That might be worthwhile, but it seems independent of this issue.

Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.

Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

regards, tom lane

#4Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#3)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Tom Lane wrote:

Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"

Incidentally we may want to improve this bit of doc to mention
explicitly the Unicode decomposed forms as a use case for
removing characters. In fact I wonder if that's not what it's
already trying to express, but confusing "languages" with "forms".

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#5Hugh Ranalli
hugh@whtc.ca
In reply to: Daniel Verite (#4)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:

Tom Lane wrote:

Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the
characters, simply dropping it should result in the correct unaccented
character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to
take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh

Show quoted text
#6Hugh Ranalli
hugh@whtc.ca
In reply to: Hugh Ranalli (#5)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.

I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.

Please let me know if you see anything I need to change.

Best wishes,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957
w: www.whtc.ca

On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh@whtc.ca> wrote:

Show quoted text

On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel@manitou-mail.org wrote:

Tom Lane wrote:

Hm, I thought the OP's proposal was just to make unaccent drop
combining diacriticals independently of context, which'd avoid the
combinatorial-growth problem.

That's what I was thinking. Given that the accent is separate from the
characters, simply dropping it should result in the correct unaccented
character.

In that case, this could be achieved by simply appending the
diacriticals themselves to unaccent.rules, since replacement of a
string by an empty string is already supported as a rule.
It doesn't seem like the current file has any of these, but from
https://www.postgresql.org/docs/11/unaccent.html :

"Alternatively, if only one character is given on a line, instances
of that character are deleted; this is useful in languages where
accents are represented by separate characters"

Yes, I had read that in the docs, and that's the approach I planned to
take. I'll go ahead and develop a patch, then.

Best wishes,
Hugh

Attachments:

remove-combining-diacritical-accents-in-unaccent.rules.patchtext/x-patch; charset=UTF-8; name=remove-combining-diacritical-accents-in-unaccent.rules.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 859cac4..201fb42 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -38,8 +38,25 @@ PLAIN_LETTER_RANGES = ((ord('a'), ord('z')), # Latin lower case
                        (0x03b1, 0x03c9),     # GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMEGA
                        (0x0391, 0x03a9))     # GREEK CAPITAL LETTER ALPHA, GREEK CAPITAL LETTER OMEGA
 
+# Combining marks follow a "base" character, and result in a composite
+# character. Example: "U&'A\0300'"produces "À".There are three types of
+# combining marks: enclosing (Me), non-spacing combining (Mn), spacing
+# combining (Mc). We identify the ranges of marks we feel safe removing.
+# References:
+#   https://en.wikipedia.org/wiki/Combining_character
+#   https://www.unicode.org/charts/PDF/U0300.pdf
+#   https://www.unicode.org/charts/PDF/U20D0.pdf
+COMBINING_MARK_RANGES = ((0x0300, 0x0362),  # Mn: Accents, IPA
+                         (0x20dd, 0x20E0),  # Me: Symbols
+                         (0x20e2, 0x20e4),) # Me: Screen, keycap, triangle
+
 def print_record(codepoint, letter):
-    print (unichr(codepoint) + "\t" + letter).encode("UTF-8")
+    if letter:
+        output = unichr(codepoint) + "\t" + letter
+    else:
+        output = unichr(codepoint)
+
+    print output.encode("UTF-8")
 
 class Codepoint:
     def __init__(self, id, general_category, combining_ids):
@@ -47,6 +64,16 @@ class Codepoint:
         self.general_category = general_category
         self.combining_ids = combining_ids
 
+def is_mark_to_remove(codepoint):
+    """Return true if this is a combining mark to remove."""
+    if not is_mark(codepoint):
+        return False
+
+    for begin, end in COMBINING_MARK_RANGES:
+        if codepoint.id >= begin and codepoint.id <= end:
+            return True
+    return False
+
 def is_plain_letter(codepoint):
     """Return true if codepoint represents a "plain letter"."""
     for begin, end in PLAIN_LETTER_RANGES:
@@ -201,6 +228,8 @@ def main(args):
                              "".join(unichr(combining_codepoint.id)
                                      for combining_codepoint \
                                      in get_plain_letters(codepoint, table))))
+        elif is_mark_to_remove(codepoint):
+            charactersSet.add((codepoint.id, None))
 
     # add CLDR Latin-ASCII characters
     if not args.noLigaturesExpansion:
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hugh Ranalli (#6)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli <hugh@whtc.ca> writes:

I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.

Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/

I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.

Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?

regards, tom lane

#8Hugh Ranalli
hugh@whtc.ca
In reply to: Tom Lane (#7)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hugh Ranalli <hugh@whtc.ca> writes:
Cool. Please add it to the current CF so we don't forget about it:
https://commitfest.postgresql.org/21/

Done.

Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?

I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
categories (
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.

In looking more closely, I also see that script isn't generating ligatures,
even though it should, because although the program can generate them, none
of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so
they are skipped.

This could probably be handled by adding the ligature ranges to the defined
ranges. Symbol types could be added to the types it looks at, and perhaps
the codepoint ranges collapsed into one list, as the IDs are unique across
all categories. I don't think we'd want to just rely on ranges, as that
could include control characters, punctuation, etc.

There are a number of other characters that appear in unaccent.rules that
aren't generated by the script. I've attached a diff of the output of
generate_unaccent_rules (using the version before my changes, to simplify
matters) and unaccent.rules. Unfortunately, I don't know how to interpret
most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an
"unaccent" function should do. Given that it's in the existing rules file,
should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Attachments:

unaccent.difftext/x-patch; charset=UTF-8; name=unaccent.diffDownload
1,8d0
< ©	(C)
< «	<<
< ­	-
< ®	(R)
< »	>>
< ¼	 1/4
< ½	 1/2
< ¾	 3/4
15d6
< Æ	AE
25d15
< Ð	D
32,33d21
< ×	*
< Ø	O
39,40d26
< Þ	TH
< ß	ss
47d32
< æ	ae
57d41
< ð	d
64,65d47
< ÷	/
< ø	o
71d52
< þ	th
89,90d69
< Đ	D
< đ	d
111,112d89
< Ħ	H
< ħ	h
122d98
< ı	i
129d104
< ĸ	q
136,139d110
< Ŀ	L
< ŀ	l
< Ł	L
< ł	l
146,148d116
< ʼn	'n
< Ŋ	N
< ŋ	n
155,156d122
< Œ	OE
< œ	oe
175,176d140
< Ŧ	T
< ŧ	t
200,222d163
< ſ	s
< ƀ	b
< Ɓ	B
< Ƃ	B
< ƃ	b
< Ƈ	C
< ƈ	c
< Ɖ	D
< Ɗ	D
< Ƌ	D
< ƌ	d
< Ɛ	E
< Ƒ	F
< ƒ	f
< Ɠ	G
< ƕ	hv
< Ɩ	I
< Ɨ	I
< Ƙ	K
< ƙ	k
< ƚ	l
< Ɲ	N
< ƞ	n
225,232d165
< Ƣ	OI
< ƣ	oi
< Ƥ	P
< ƥ	p
< ƫ	t
< Ƭ	T
< ƭ	t
< Ʈ	T
235,239d167
< Ʋ	V
< Ƴ	Y
< ƴ	y
< Ƶ	Z
< ƶ	z
269,270d196
< Ǥ	G
< ǥ	g
319,321d244
< ȡ	d
< Ȥ	Z
< ȥ	z
336,401d258
< ȴ	l
< ȵ	n
< ȶ	t
< ȷ	j
< ȸ	db
< ȹ	qp
< Ⱥ	A
< Ȼ	C
< ȼ	c
< Ƚ	L
< Ⱦ	T
< ȿ	s
< ɀ	z
< Ƀ	B
< Ʉ	U
< Ɇ	E
< ɇ	e
< Ɉ	J
< ɉ	j
< Ɍ	R
< ɍ	r
< Ɏ	Y
< ɏ	y
< ɓ	b
< ɕ	c
< ɖ	d
< ɗ	d
< ɛ	e
< ɟ	j
< ɠ	g
< ɡ	g
< ɢ	G
< ɦ	h
< ɧ	h
< ɨ	i
< ɪ	I
< ɫ	l
< ɬ	l
< ɭ	l
< ɱ	m
< ɲ	n
< ɳ	n
< ɴ	N
< ɶ	OE
< ɼ	r
< ɽ	r
< ɾ	r
< ʀ	R
< ʂ	s
< ʈ	t
< ʉ	u
< ʋ	v
< ʏ	Y
< ʐ	z
< ʑ	z
< ʙ	B
< ʛ	G
< ʜ	H
< ʝ	j
< ʟ	L
< ʠ	q
< ʣ	dz
< ʥ	dz
< ʦ	ts
< ʪ	ls
< ʫ	lz
424,477d280
< ᴀ	A
< ᴁ	AE
< ᴃ	B
< ᴄ	C
< ᴅ	D
< ᴆ	D
< ᴇ	E
< ᴊ	J
< ᴋ	K
< ᴌ	L
< ᴍ	M
< ᴏ	O
< ᴘ	P
< ᴛ	T
< ᴜ	U
< ᴠ	V
< ᴡ	W
< ᴢ	Z
< ᵫ	ue
< ᵬ	b
< ᵭ	d
< ᵮ	f
< ᵯ	m
< ᵰ	n
< ᵱ	p
< ᵲ	r
< ᵳ	r
< ᵴ	s
< ᵵ	t
< ᵶ	z
< ᵺ	th
< ᵻ	I
< ᵽ	p
< ᵾ	U
< ᶀ	b
< ᶁ	d
< ᶂ	f
< ᶃ	g
< ᶄ	k
< ᶅ	l
< ᶆ	m
< ᶇ	n
< ᶈ	p
< ᶉ	r
< ᶊ	s
< ᶌ	v
< ᶍ	x
< ᶎ	z
< ᶏ	a
< ᶑ	d
< ᶒ	e
< ᶓ	e
< ᶖ	i
< ᶙ	u
632,635d434
< ẚ	a
< ẜ	s
< ẝ	s
< ẞ	SS
726,731d524
< Ỻ	LL
< ỻ	ll
< Ỽ	V
< ỽ	v
< Ỿ	Y
< ỿ	y
933,972d725
< ‐	-
< ‑	-
< ‒	-
< –	-
< —	-
< ―	-
< ‖	||
< ‘	'
< ’	'
< ‚	,
< ‛	'
< “	"
< ”	"
< „	,,
< ‟	"
< ․	.
< ‥	..
< …	...
< ′	'
< ″	"
< ‹	<
< ›	>
< ‼	!!
< ⁄	/
< ⁅	[
< ⁆	]
< ⁇	??
< ⁈	?!
< ⁉	!?
< ⁎	*
< ₠	CE
< ₢	Cr
< ₣	Fr.
< ₤	L.
< ₧	Pts
< ₹	Rs
< ₺	TL
< ℀	a/c
< ℁	a/s
< ℂ	C
974,975d726
< ℅	c/o
< ℆	c/u
977,987d727
< ℊ	g
< ℋ	H
< ℌ	x
< ℍ	H
< ℎ	h
< ℐ	I
< ℑ	I
< ℒ	L
< ℓ	l
< ℕ	N
< №	No
989,1230d728
< ℙ	P
< ℚ	Q
< ℛ	R
< ℜ	R
< ℝ	R
< ℞	Rx
< ℡	TEL
< ℤ	Z
< ℨ	Z
< ℬ	B
< ℭ	C
< ℯ	e
< ℰ	E
< ℱ	F
< ℳ	M
< ℴ	o
< ℹ	i
< ℻	FAX
< ⅅ	D
< ⅆ	d
< ⅇ	e
< ⅈ	i
< ⅉ	j
< ⅓	 1/3
< ⅔	 2/3
< ⅕	 1/5
< ⅖	 2/5
< ⅗	 3/5
< ⅘	 4/5
< ⅙	 1/6
< ⅚	 5/6
< ⅛	 1/8
< ⅜	 3/8
< ⅝	 5/8
< ⅞	 7/8
< ⅟	 1/
< Ⅰ	I
< Ⅱ	II
< Ⅲ	III
< Ⅳ	IV
< Ⅴ	V
< Ⅵ	VI
< Ⅶ	VII
< Ⅷ	VIII
< Ⅸ	IX
< Ⅹ	X
< Ⅺ	XI
< Ⅻ	XII
< Ⅼ	L
< Ⅽ	C
< Ⅾ	D
< Ⅿ	M
< ⅰ	i
< ⅱ	ii
< ⅲ	iii
< ⅳ	iv
< ⅴ	v
< ⅵ	vi
< ⅶ	vii
< ⅷ	viii
< ⅸ	ix
< ⅹ	x
< ⅺ	xi
< ⅻ	xii
< ⅼ	l
< ⅽ	c
< ⅾ	d
< ⅿ	m
< −	-
< ∕	/
< ∖	\
< ∣	|
< ∥	||
< ≪	<<
< ≫	>>
< ⑴	(1)
< ⑵	(2)
< ⑶	(3)
< ⑷	(4)
< ⑸	(5)
< ⑹	(6)
< ⑺	(7)
< ⑻	(8)
< ⑼	(9)
< ⑽	(10)
< ⑾	(11)
< ⑿	(12)
< ⒀	(13)
< ⒁	(14)
< ⒂	(15)
< ⒃	(16)
< ⒄	(17)
< ⒅	(18)
< ⒆	(19)
< ⒇	(20)
< ⒈	1.
< ⒉	2.
< ⒊	3.
< ⒋	4.
< ⒌	5.
< ⒍	6.
< ⒎	7.
< ⒏	8.
< ⒐	9.
< ⒑	10.
< ⒒	11.
< ⒓	12.
< ⒔	13.
< ⒕	14.
< ⒖	15.
< ⒗	16.
< ⒘	17.
< ⒙	18.
< ⒚	19.
< ⒛	20.
< ⒜	(a)
< ⒝	(b)
< ⒞	(c)
< ⒟	(d)
< ⒠	(e)
< ⒡	(f)
< ⒢	(g)
< ⒣	(h)
< ⒤	(i)
< ⒥	(j)
< ⒦	(k)
< ⒧	(l)
< ⒨	(m)
< ⒩	(n)
< ⒪	(o)
< ⒫	(p)
< ⒬	(q)
< ⒭	(r)
< ⒮	(s)
< ⒯	(t)
< ⒰	(u)
< ⒱	(v)
< ⒲	(w)
< ⒳	(x)
< ⒴	(y)
< ⒵	(z)
< ⦅	((
< ⦆	))
< ⩴	::=
< ⩵	==
< ⩶	===
< 、	,
< 。	.
< 〇	0
< 〈	<
< 〉	>
< 《	<<
< 》	>>
< 〔	[
< 〕	]
< 〘	[
< 〙	]
< 〚	[
< 〛	]
< 〝	"
< 〞	"
< ㍱	hPa
< ㍲	da
< ㍳	AU
< ㍴	bar
< ㍵	oV
< ㍶	pc
< ㍷	dm
< ㍺	IU
< ㎀	pA
< ㎁	nA
< ㎃	mA
< ㎄	kA
< ㎅	KB
< ㎆	MB
< ㎇	GB
< ㎈	cal
< ㎉	kcal
< ㎊	pF
< ㎋	nF
< ㎎	mg
< ㎏	kg
< ㎐	Hz
< ㎑	kHz
< ㎒	MHz
< ㎓	GHz
< ㎔	THz
< ㎙	fm
< ㎚	nm
< ㎜	mm
< ㎝	cm
< ㎞	km
< ㎧	m/s
< ㎩	Pa
< ㎪	kPa
< ㎫	MPa
< ㎬	GPa
< ㎭	rad
< ㎮	rad/s
< ㎰	ps
< ㎱	ns
< ㎳	ms
< ㎴	pV
< ㎵	nV
< ㎷	mV
< ㎸	kV
< ㎹	MV
< ㎺	pW
< ㎻	nW
< ㎽	mW
< ㎾	kW
< ㎿	MW
< ㏂	a.m.
< ㏃	Bq
< ㏄	cc
< ㏅	cd
< ㏆	C/kg
< ㏇	Co.
< ㏈	dB
< ㏉	Gy
< ㏊	ha
< ㏋	HP
< ㏌	in
< ㏍	KK
< ㏎	KM
< ㏏	kt
< ㏐	lm
< ㏑	ln
< ㏒	log
< ㏓	lx
< ㏔	mb
< ㏕	mil
< ㏖	mol
< ㏗	pH
< ㏘	p.m.
< ㏙	PPM
< ㏚	PR
< ㏛	sr
< ㏜	Sv
< ㏝	Wb
< ㏞	V/m
< ㏟	A/m
1236d733
< ſt	st
1238,1384d734
< ︐	,
< ︑	,
< ︒	.
< ︓	:
< ︔	;
< ︕	!
< ︖	?
< ︙	...
< ︰	..
< ︱	-
< ︲	-
< ︵	(
< ︶	)
< ︷	{
< ︸	}
< ︹	[
< ︺	]
< ︽	<<
< ︾	>>
< ︿	<
< ﹀	>
< ﹇	[
< ﹈	]
< ﹐	,
< ﹑	,
< ﹒	.
< ﹔	;
< ﹕	:
< ﹖	?
< ﹗	!
< ﹘	-
< ﹙	(
< ﹚	)
< ﹛	{
< ﹜	}
< ﹝	[
< ﹞	]
< ﹟	#
< ﹠	&
< ﹡	*
< ﹢	+
< ﹣	-
< ﹤	<
< ﹥	>
< ﹦	=
< ﹨	\
< ﹩	$
< ﹪	%
< ﹫	@
< !	!
< "	"
< #	#
< $	$
< %	%
< &	&
< '	'
< (	(
< )	)
< *	*
< +	+
< ,	,
< -	-
< .	.
< /	/
< 0	0
< 1	1
< 2	2
< 3	3
< 4	4
< 5	5
< 6	6
< 7	7
< 8	8
< 9	9
< :	:
< ;	;
< <	<
< =	=
< >	>
< ?	?
< @	@
< A	A
< B	B
< C	C
< D	D
< E	E
< F	F
< G	G
< H	H
< I	I
< J	J
< K	K
< L	L
< M	M
< N	N
< O	O
< P	P
< Q	Q
< R	R
< S	S
< T	T
< U	U
< V	V
< W	W
< X	X
< Y	Y
< Z	Z
< [	[
< \	\
< ]	]
< ^	^
< _	_
< `	`
< a	a
< b	b
< c	c
< d	d
< e	e
< f	f
< g	g
< h	h
< i	i
< j	j
< k	k
< l	l
< m	m
< n	n
< o	o
< p	p
< q	q
< r	r
< s	s
< t	t
< u	u
< v	v
< w	w
< x	x
< y	y
< z	z
< {	{
< |	|
< }	}
< ~	~
< ⦅	((
< ⦆	))
< 。	.
< 、	,
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hugh Ranalli (#8)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli <hugh@whtc.ca> writes:

On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Me too -- seems like that bears looking into. Perhaps the script's
results are platform dependent -- what were you testing on?

I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
categories (
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.

Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

I wrote:

... I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.
I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

A few minutes later on a Fedora 28 box: python 2.7.15 also gives me the
expected results, while python 3.6.6 fails with "SyntaxError: invalid
syntax".

So updating that script to also work with python3 might be a worthwhile
TODO item. But I'm at a loss to explain why you get different results.

regards, tom lane

#11Hugh Ranalli
hugh@whtc.ca
In reply to: Tom Lane (#9)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

Well, that's embarrassing. When I looked I couldn't see anything that
looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
We use other versions of 2.7 on our production platforms. I'll take another
look, and check the URLs I am using.

#12Hugh Ranalli
hugh@whtc.ca
In reply to: Hugh Ranalli (#11)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh@whtc.ca> wrote:

On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm. Something funny is going on here. When I fetch the two reference
files from the URLs cited in the script, and do

python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml >newrules

I get something that's bit-for-bit the same as what's in unaccent.rules.
So there's clearly a platform difference between here and there.

I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
it on anything newer.

Well, that's embarrassing. When I looked I couldn't see anything that
looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
We use other versions of 2.7 on our production platforms. I'll take another
look, and check the URLs I am using.

The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hugh Ranalli (#12)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli <hugh@whtc.ca> writes:

The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set.

Ah-hah.

I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?

(b) seems sufficient to me, but perhaps someone else has a different
opinion.

Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.

regards, tom lane

#14Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#13)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hugh Ranalli <hugh@whtc.ca> writes:

The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set.

Ah-hah.

I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?

(b) seems sufficient to me, but perhaps someone else has a different
opinion.

Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.

+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().

--
Thomas Munro
http://www.enterprisedb.com

#15Hugh Ranalli
hugh@whtc.ca
In reply to: Thomas Munro (#14)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Sat, 15 Dec 2018 at 21:26, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().

Well, when I modified the code to handle the new version of the
transliteration file, I discovered that was sufficient to handle the old
version as well. That's not the way things usually go, but I'll take it. ;-)

I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file. I'll
be happy to add these to the CF. Does anyone need to review them and give
me approval before I do so?

Best wishes,
Hugh

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hugh Ranalli (#15)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli <hugh@whtc.ca> writes:

I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

I'll
be happy to add these to the CF. Does anyone need to review them and give
me approval before I do so?

Nope.

regards, tom lane

#17Hugh Ranalli
hugh@whtc.ca
In reply to: Tom Lane (#16)
2 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hugh Ranalli <hugh@whtc.ca> writes:

I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

Sigh, yes I did. That's what I get for trying to get this sent out before
heading to an appointment. Patches attached and will add to CF. Let me know
if you see anything amiss.

Hugh

Attachments:

unaccent.rules-update-to-Latin-ASCII-CDLR-v34.patchtext/x-patch; charset=UTF-8; name=unaccent.rules-update-to-Latin-ASCII-CDLR-v34.patchDownload
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 76e4e69..7ce25ee 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -399,6 +399,21 @@
 ʦ	ts
 ʪ	ls
 ʫ	lz
+ʹ	'
+ʺ	"
+ʻ	'
+ʼ	'
+ʽ	'
+˂	<
+˃	>
+˄	^
+ˆ	^
+ˈ	'
+ˋ	`
+ː	:
+˖	+
+˗	-
+˜	~
 Ά	Α
 Έ	Ε
 Ή	Η
generate_unaccent_rules-handle-all-Latin-ASCII-versions.patchtext/x-patch; charset=US-ASCII; name=generate_unaccent_rules-handle-all-Latin-ASCII-versions.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 859cac4..761b237 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -21,7 +21,8 @@
 # command line argument -- will be parsed and used.
 #
 # [1] http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+#     (Ideally you should use the latest release).
 
 
 import re
@@ -121,10 +122,14 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
     # construct tree from XML
     transliterationTree = ET.parse(latinAsciiFilePath)
     transliterationTreeRoot = transliterationTree.getroot()
-
-    for rule in transliterationTreeRoot.findall("./transforms/transform/tRule"):
-        matches = rulePattern.search(rule.text)
-
+    rules = []
+    for element in transliterationTreeRoot.findall(
+      "./transforms/transform/tRule"
+      ):
+        rules.extend(element.text.strip().split("\n"))
+
+    for rule in rules:
+        matches = rulePattern.search(rule)
         # The regular expression capture four groups corresponding
         # to the characters.
         #
#18Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Hugh Ranalli (#17)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote:

On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hugh Ranalli <hugh@whtc.ca> writes:

I've attached two patches, one to update generate_unaccent_rules.py, and
another that updates unaccent.rules from the v34 transliteration file.

I think you forgot the patches?

Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.

+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~

I don't think this is quite right. Those don't seem to be the
combining codepoints[1]https://en.wikipedia.org/wiki/Combining_Diacritical_Marks, and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:

select unaccent('un café crème s''il vous plaît');

It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a ai..t');..

(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)

[1]: https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

x.sqlapplication/octet-stream; name=x.sqlDownload
#19Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#18)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Dec 18, 2018 at 3:05 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh@whtc.ca> wrote:
+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~

I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:

select unaccent('un café crème s''il vous plaît');

Oh, I see now that that was just the v34 ASCII transliteration update,
and perhaps the diacritic stripping will be posted separately.

--
Thomas Munro
http://www.enterprisedb.com

#20Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#18)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Dec 18, 2018 at 03:05:00PM +1100, Thomas Munro wrote:

I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:

select unaccent('un café crème s''il vous plaît');

It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.
--
Michael

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#20)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Michael Paquier <michael@paquier.xyz> writes:

Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.

I wonder why unaccent.sql is set up to run its tests in KOI8 client
encoding rather than UTF8. It doesn't seem like it's the business
of this test script to be verifying transcoding from KOI8 to UTF8
(and if it were meant to do that, it's a pretty incomplete test...).
But having it set up like that means that we can't directly add
such tests to unaccent.sql, because there are no combining diacritics
in the KOI8 character set. We have two unattractive options:

* Change client encodings partway through unaccent.sql. I think this
would be disastrous for editability of that file; no common tools
will understand the encoding change.

* Put the new test cases into a separate file with a different client
encoding. This is workable, I suppose, but it seems pretty silly
when the tests are only a few queries apiece.

Another problem I've got with the current setup is that it seems
unlikely that many people's editors default to an assumption of
KOI8 encoding. Mine guesses that these files are UTF8, and so
the test cases look perfectly insane. They do make sense if
I transcode the files to UTF8, but I wonder why we're not shipping
them as UTF8 in the first place.

tl;dr: I think we should convert unaccent.sql and unaccent.out
to UTF8 encoding. Then, adding more test cases for this patch
will be easy.

regards, tom lane

#22Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#21)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:

tl;dr: I think we should convert unaccent.sql and unaccent.out
to UTF8 encoding. Then, adding more test cases for this patch
will be easy.

Do you think that we could also remove the non-ASCII characters from the
tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in
input, and show the output with bytea. That's harder to read, still we
discussed about not using UTF-8 in the python script to allow folks with
simple terminals to touch the code the last time this was touched
(5e8d670) and the characters used could be documented as comments in the
tests.
--
Michael

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#22)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Michael Paquier <michael@paquier.xyz> writes:

On Tue, Dec 18, 2018 at 12:36:02AM -0500, Tom Lane wrote:

tl;dr: I think we should convert unaccent.sql and unaccent.out
to UTF8 encoding. Then, adding more test cases for this patch
will be easy.

Do you think that we could also remove the non-ASCII characters from the
tests? It would be easy enough to use E'\xNN' (utf8 hex) or such in
input, and show the output with bytea.

I'm not really for that, because it would make the test cases harder
to verify by eyeball. With the current setup --- other than the
uncommon-outside-Russia encoding choice --- you don't really need
to read or speak Russian to see that this:

SELECT unaccent('ёлка');
unaccent
----------
елка
(1 row)

probably represents unaccent doing what it ought to. If everything
is in hex then it's a lot harder.

Ten years ago I might've agreed with your point, but today it's
hard to believe that anyone who takes any interest at all in
unaccent's functionality would not have a UTF8-capable terminal.

That's harder to read, still we
discussed about not using UTF-8 in the python script to allow folks with
simple terminals to touch the code the last time this was touched
(5e8d670) and the characters used could be documented as comments in the
tests.

Maybe I'm misremembering, but I thought that discussion was about the
code files. I am still mistrustful of non-ASCII in our code files.
But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8.

regards, tom lane

#24Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#23)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Dec 18, 2018 at 01:23:57AM -0500, Tom Lane wrote:

Maybe I'm misremembering, but I thought that discussion was about the
code files. I am still mistrustful of non-ASCII in our code files.

Yes, that was in generate_unaccent_rules.py:
/messages/by-id/25859.1535076450@sss.pgh.pa.us

But for data and test files, we've been accepting UTF8 ever since the
text-search-in-core stuff landed. Heck, unaccent.rules itself is UTF8.

Okay, fine by me.
--
Michael

#25Hugh Ranalli
hugh@whtc.ca
In reply to: Thomas Munro (#18)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, 17 Dec 2018 at 23:05, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

+ʹ    '
+ʺ    "
+ʻ    '
+ʼ    '
+ʽ    '
+˂    <
+˃    >
+˄    ^
+ˆ    ^
+ˈ    '
+ˋ    `
+ː    :
+˖    +
+˗    -
+˜    ~

These aren't the combining codepoints. They're new substitutions defined in
r34 of the Latin-ASCII transliteration file. I had wondered about those,
too, and did some testing.

I don't think this is quite right.

However, you are correct that something isn't write. In testing why I was
getting a different output, I had reverted to the
generate_unaccent_rules.py BEFORE my changes. And then I applied my update
for the transliteration file format to the reverted version. The patch for
generate_unaccent_rules should still be good, but the generated rules file
didn't include the combining diacriticals. In generating that, I want to
double check some of the additions before re-submitting.

On Mon, 17 Dec 2018 at 23:57, Michael Paquier <michael@paquier.xyz> wrote:

Could you also add some tests in contrib/unaccent/sql/unaccent.sql at
the same time? That would be nice to check easily the extent of the
patches proposed on this thread.

That makes sense. I'm happy to do that. Let me look at that file and see
how extensive the other changes (encoding and removal of special characters
would be).

Hugh

#26Hugh Ranalli
hugh@whtc.ca
In reply to: Hugh Ranalli (#25)
3 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Okay, I've tried to separate everything cleanly. The patches are numbered
in the order in which they should be applied. Each patch contains all the
updates appropriate to that version (i.e., if the change would modify
unaccent.rules, those changes are also in the patch):

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.
The approach I have taken is "native" Python 3 compatibility with
adjustments for Python 2. There's a marked block at the beginning of the
file that can be removed whenever Python 2 support is dropped. I haven't
followed the recommended practice of importing the "past" or "future"
modules, as the changes are minimal, and these are just additional
dependencies that need to be installed separately, which didn't seem to
make sense for a utility script. This patch also updates sql/unaccent.sql
to UTF-8 format.

02 - Updates generate_unaccent_rules.py to work with all versions (I tested
r28 and r34) of the Latin-ASCII transliteration file. It also updates
unaccent.rules to have the output of the r34 transliteration file. This
patch should work without the 01 patch.

03 - Updates generate_unaccent_rules.py to remove combining diacritical
marks. It also updates unaccent.rules with the revised output, and adds
tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not
applied. It should without the 02 patch.

When you look at unaccent.rules generated by the 03 version, there may
appear to be blank lines. I've checked and they're not blank. They are
characters which are only visible with other characters in front of them,
at least in my editor.

I'll go update the CommitFest now. I hope I've covered everything; please
let me know if there's anything I've missed.

Best wishes,
Hugh

Attachments:

01-generate-unaccent-rules-python2-and-3-01.patchtext/x-patch; charset=GB18030; name=01-generate-unaccent-rules-python2-and-3-01.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 859cac4..53e9fbb 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -23,6 +23,24 @@
 # [1] http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
 # [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
 
+# BEGIN: Python 2/3 compatibility - Remove when Python 2 compatibility dropped
+# Approach is to be Python3 compatible with Python2 "backports"
+from __future__ import unicode_literals
+from __future__ import print_function
+import codecs
+import sys
+
+if sys.version_info[0] <= 2:
+    # Encode stdout as UTF-8, so we can just print to it
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout)
+
+    # Map Python 2's chr to unichr
+    chr = unichr
+
+    # Python 2 and 3 compatible bytes call
+    def bytes(source, encoding='ascii', errors='strict'):
+        return source.encode(encoding=encoding, errors=errors)
+# END: Python 2/3 compatibility - Remove when Python 2 compatibility dropped
 
 import re
 import argparse
@@ -39,7 +57,7 @@ PLAIN_LETTER_RANGES = ((ord('a'), ord('z')), # Latin lower case
                        (0x0391, 0x03a9))     # GREEK CAPITAL LETTER ALPHA, GREEK CAPITAL LETTER OMEGA
 
 def print_record(codepoint, letter):
-    print (unichr(codepoint) + "\t" + letter).encode("UTF-8")
+    print (chr(codepoint) + "\t" + letter)
 
 class Codepoint:
     def __init__(self, id, general_category, combining_ids):
@@ -116,7 +134,7 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
     charactersSet = set()
 
     # RegEx to parse rules
-    rulePattern = re.compile(ur'^(?:(.)|(\\u[0-9a-fA-F]{4})) \u2192 (?:\'(.+)\'|(.+)) ;')
+    rulePattern = re.compile(r'^(?:(.)|(\\u[0-9a-fA-F]{4})) \u2192 (?:\'(.+)\'|(.+)) ;')
 
     # construct tree from XML
     transliterationTree = ET.parse(latinAsciiFilePath)
@@ -134,7 +152,9 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
         # Group 3: plain "trg" char. Empty if group 4 is not.
         # Group 4: plain "trg" char between quotes. Empty if group 3 is not.
         if matches is not None:
-            src = matches.group(1) if matches.group(1) is not None else matches.group(2).decode('unicode-escape')
+            src = matches.group(1) if matches.group(1) is not None else bytes(
+                matches.group(2), 'UTF-8'
+                ).decode('unicode-escape')
             trg = matches.group(3) if matches.group(3) is not None else matches.group(4)
 
             # "'" and """ are escaped
@@ -195,10 +215,10 @@ def main(args):
            len(codepoint.combining_ids) > 1:
             if is_letter_with_marks(codepoint, table):
                 charactersSet.add((codepoint.id,
-                             unichr(get_plain_letter(codepoint, table).id)))
+                             chr(get_plain_letter(codepoint, table).id)))
             elif args.noLigaturesExpansion is False and is_ligature(codepoint, table):
                 charactersSet.add((codepoint.id,
-                             "".join(unichr(combining_codepoint.id)
+                             "".join(chr(combining_codepoint.id)
                                      for combining_codepoint \
                                      in get_plain_letters(codepoint, table))))
 
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index 3102139..77c02c7 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -3,16 +3,16 @@ CREATE EXTENSION unaccent;
 -- must have a UTF8 database
 SELECT getdatabaseencoding();
 
-SET client_encoding TO 'KOI8';
+SET client_encoding TO 'UTF-8';
 
 SELECT unaccent('foobar');
-SELECT unaccent('L肆');
-SELECT unaccent('出殡');
+SELECT unaccent('褢谢泻邪');
+SELECT unaccent('衼袞袠袣');
 
 SELECT unaccent('unaccent', 'foobar');
-SELECT unaccent('unaccent', 'L肆');
-SELECT unaccent('unaccent', '出殡');
+SELECT unaccent('unaccent', '褢谢泻邪');
+SELECT unaccent('unaccent', '衼袞袠袣');
 
 SELECT ts_lexize('unaccent', 'foobar');
-SELECT ts_lexize('unaccent', 'L肆');
-SELECT ts_lexize('unaccent', '出殡');
+SELECT ts_lexize('unaccent', '褢谢泻邪');
+SELECT ts_lexize('unaccent', '衼袞袠袣');
02-generate_unaccent_rules-handle-all-Latin-ASCII-versions-01.patchtext/x-patch; charset=UTF-8; name=02-generate_unaccent_rules-handle-all-Latin-ASCII-versions-01.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 53e9fbb..a0cc8c9 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -21,7 +21,8 @@
 # command line argument -- will be parsed and used.
 #
 # [1] http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+#     (Ideally you should use the latest release).
 
 # BEGIN: Python 2/3 compatibility - Remove when Python 2 compatibility dropped
 # Approach is to be Python3 compatible with Python2 "backports"
@@ -140,9 +141,14 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
     transliterationTree = ET.parse(latinAsciiFilePath)
     transliterationTreeRoot = transliterationTree.getroot()
 
-    for rule in transliterationTreeRoot.findall("./transforms/transform/tRule"):
-        matches = rulePattern.search(rule.text)
+    rules = []
+    for element in transliterationTreeRoot.findall(
+      "./transforms/transform/tRule"
+      ):
+        rules.extend(element.text.strip().split("\n"))
 
+    for rule in rules:
+        matches = rulePattern.search(rule)
         # The regular expression capture four groups corresponding
         # to the characters.
         #
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 76e4e69..7ce25ee 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -399,6 +399,21 @@
 ʦ	ts
 ʪ	ls
 ʫ	lz
+ʹ	'
+ʺ	"
+ʻ	'
+ʼ	'
+ʽ	'
+˂	<
+˃	>
+˄	^
+ˆ	^
+ˈ	'
+ˋ	`
+ː	:
+˖	+
+˗	-
+˜	~
 Ά	Α
 Έ	Ε
 Ή	Η
03-generate_unaccent_rules-remove-combining-diacritical-accents-01.patchtext/x-patch; charset=UTF-8; name=03-generate_unaccent_rules-remove-combining-diacritical-accents-01.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index a0cc8c9..de0dabc 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -57,8 +57,25 @@ PLAIN_LETTER_RANGES = ((ord('a'), ord('z')), # Latin lower case
                        (0x03b1, 0x03c9),     # GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMEGA
                        (0x0391, 0x03a9))     # GREEK CAPITAL LETTER ALPHA, GREEK CAPITAL LETTER OMEGA
 
+# Combining marks follow a "base" character, and result in a composite
+# character. Example: "U&'A\0300'"produces "À".There are three types of
+# combining marks: enclosing (Me), non-spacing combining (Mn), spacing
+# combining (Mc). We identify the ranges of marks we feel safe removing.
+# References:
+#   https://en.wikipedia.org/wiki/Combining_character
+#   https://www.unicode.org/charts/PDF/U0300.pdf
+#   https://www.unicode.org/charts/PDF/U20D0.pdf
+COMBINING_MARK_RANGES = ((0x0300, 0x0362),  # Mn: Accents, IPA
+                         (0x20dd, 0x20E0),  # Me: Symbols
+                         (0x20e2, 0x20e4),) # Me: Screen, keycap, triangle
+
 def print_record(codepoint, letter):
-    print (chr(codepoint) + "\t" + letter)
+    if letter:
+        output = chr(codepoint) + "\t" + letter
+    else:
+        output = chr(codepoint)
+
+    print(output)
 
 class Codepoint:
     def __init__(self, id, general_category, combining_ids):
@@ -66,6 +83,16 @@ class Codepoint:
         self.general_category = general_category
         self.combining_ids = combining_ids
 
+def is_mark_to_remove(codepoint):
+    """Return true if this is a combining mark to remove."""
+    if not is_mark(codepoint):
+        return False
+
+    for begin, end in COMBINING_MARK_RANGES:
+        if codepoint.id >= begin and codepoint.id <= end:
+            return True
+    return False
+
 def is_plain_letter(codepoint):
     """Return true if codepoint represents a "plain letter"."""
     for begin, end in PLAIN_LETTER_RANGES:
@@ -227,6 +254,8 @@ def main(args):
                              "".join(chr(combining_codepoint.id)
                                      for combining_codepoint \
                                      in get_plain_letters(codepoint, table))))
+        elif is_mark_to_remove(codepoint):
+            charactersSet.add((codepoint.id, None))
 
     # add CLDR Latin-ASCII characters
     if not args.noLigaturesExpansion:
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index 77c02c7..4ff21f8 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -8,11 +8,14 @@ SET client_encoding TO 'UTF-8';
 SELECT unaccent('foobar');
 SELECT unaccent('ёлка');
 SELECT unaccent('ЁЖИК');
+SELECT unaccent('À');  -- Remove combining diacritical 0x0300
 
 SELECT unaccent('unaccent', 'foobar');
 SELECT unaccent('unaccent', 'ёлка');
 SELECT unaccent('unaccent', 'ЁЖИК');
+SELECT unaccent('unaccent', 'À');
 
 SELECT ts_lexize('unaccent', 'foobar');
 SELECT ts_lexize('unaccent', 'ёлка');
 SELECT ts_lexize('unaccent', 'ЁЖИК');
+SELECT ts_lexize('unaccent', 'À');
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 7ce25ee..9982640 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -414,6 +414,105 @@
 ˖	+
 ˗	-
 ˜	~
+̀
+́
+̂
+̃
+̄
+̅
+̆
+̇
+̈
+̉
+̊
+̋
+̌
+̍
+̎
+̏
+̐
+̑
+̒
+̓
+̔
+̕
+̖
+̗
+̘
+̙
+̚
+̛
+̜
+̝
+̞
+̟
+̠
+̡
+̢
+̣
+̤
+̥
+̦
+̧
+̨
+̩
+̪
+̫
+̬
+̭
+̮
+̯
+̰
+̱
+̲
+̳
+̴
+̵
+̶
+̷
+̸
+̹
+̺
+̻
+̼
+̽
+̾
+̿
+̀
+́
+͂
+̓
+̈́
+ͅ
+͆
+͇
+͈
+͉
+͊
+͋
+͌
+͍
+͎
+͏
+͐
+͑
+͒
+͓
+͔
+͕
+͖
+͗
+͘
+͙
+͚
+͛
+͜
+͝
+͞
+͟
+͠
+͡
+͢
 Ά	Α
 Έ	Ε
 Ή	Η
@@ -982,6 +1081,13 @@
 ₧	Pts
 ₹	Rs
 ₺	TL
+⃝
+⃞
+⃟
+⃠
+⃢
+⃣
+⃤
 ℀	a/c
 ℁	a/s
 ℂ	C
#27Michael Paquier
michael@paquier.xyz
In reply to: Hugh Ranalli (#26)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Thu, Dec 20, 2018 at 05:39:36PM -0500, Hugh Ranalli wrote:

I'll go update the CommitFest now. I hope I've covered everything; please
let me know if there's anything I've missed.

-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+#     (Ideally you should use the latest release).

I have begun playing with this patch set. And for the note this URL
is incorrect. Here is a more correct one:
https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml

And for the information it is possible to get the latest released
versions by browsing the code (see the tags release-*):
https://unicode.org/cldr/trac/browser/tags
--
Michael

#28Hugh Ranalli
hugh@whtc.ca
In reply to: Michael Paquier (#27)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Thu, 27 Dec 2018 at 01:50, Michael Paquier <michael@paquier.xyz> wrote:

-# [2]
http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2]
http://unicode.org/cldr/trac/export/12304/tags/release-34/common/transforms/Latin-ASCII.xml
+#     (Ideally you should use the latest release).

I have begun playing with this patch set. And for the note this URL
is incorrect. Here is a more correct one:

https://unicode.org/cldr/trac/browser/tags/release-34/common/transforms/Latin-ASCII.xml

And for the information it is possible to get the latest released
versions by browsing the code (see the tags release-*):
https://unicode.org/cldr/trac/browser/tags

Thank you. As I've said, I only pretend to be someone who knows something
about Unicode. ;-) I'll update once we've determined there is no further
feedback, so I'm not releasing too many changes, if that's okay.

Hugh

#29Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Hugh Ranalli (#26)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On 20/12/2018 23:39, Hugh Ranalli wrote:

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.

My opinion is that we should just convert the whole thing to Python 3
and be done. This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.

The only other Python script I can find in the source is
src/test/locale/sort-test.py, which we should similarly convert.

This patch also updates sql/unaccent.sql to UTF-8 format.

I have committed that in the meantime.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30Hugh Ranalli
hugh@whtc.ca
In reply to: Peter Eisentraut (#29)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 20/12/2018 23:39, Hugh Ranalli wrote:

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.

My opinion is that we should just convert the whole thing to Python 3
and be done. This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.

Well, this is definitely an edge case, but I am actually running the
patched script from a complex application installer running a
custom-compiled version of Python 2.7. The installer runs under the same
Python instance as the application. I certainly could invoke Python 3 to
run this script, it's just a little more work, so I'm happy to go with the
team's decision. Just let me know.

Hugh

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hugh Ranalli (#30)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hugh Ranalli <hugh@whtc.ca> writes:

On Wed, 2 Jan 2019 at 12:41, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

My opinion is that we should just convert the whole thing to Python 3
and be done. This script is only run rarely, on a developer's machine,
so it's not unreasonable to expect Python 3 to be available.

Well, this is definitely an edge case, but I am actually running the
patched script from a complex application installer running a
custom-compiled version of Python 2.7. The installer runs under the same
Python instance as the application. I certainly could invoke Python 3 to
run this script, it's just a little more work, so I'm happy to go with the
team's decision. Just let me know.

Seeing that supporting python 2 only adds a dozen lines of code,
I vote for retaining it for now. It'd be appropriate to drop that when
python 3 is the overwhelmingly more-installed version, but AFAICT that
isn't the case yet.

regards, tom lane

#32Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#31)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Wed, Jan 02, 2019 at 02:32:32PM -0500, Tom Lane wrote:

Seeing that supporting python 2 only adds a dozen lines of code,
I vote for retaining it for now. It'd be appropriate to drop that when
python 3 is the overwhelmingly more-installed version, but AFAICT that
isn't the case yet.

As a side note, if I recall correctly Python 2.7 will be EOL'd in
2020 by community, though I suspect that a couple of vendors will
still maintain compatibility for a couple of years in what they ship.
CentOS and RHEL enter in this category perhaps. Like Peter, I would
vote for just maintaining support for Python 3 in this script, as any
modern development machines have it anyway, and not a lot of commits
involve it (I am counting 4 since 2015).
--
Michael

#33Hugh Ranalli
hugh@whtc.ca
In reply to: Michael Paquier (#32)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Wed, 2 Jan 2019 at 20:15, Michael Paquier <michael@paquier.xyz> wrote:

As a side note, if I recall correctly Python 2.7 will be EOL'd in
2020 by community, though I suspect that a couple of vendors will
still maintain compatibility for a couple of years in what they ship.
CentOS and RHEL enter in this category perhaps. Like Peter, I would
vote for just maintaining support for Python 3 in this script, as any
modern development machines have it anyway, and not a lot of commits
involve it (I am counting 4 since 2015).

I realise this is an incredibly minor component of the PostgreSQL
infrastructure, but as I don't want to hold up reviewers, may I ask:

- It seems we have two votes for Python 3 only, and one for Python 2/3.
I lean toward Python 2/3 myself because: a) many distributions still ship
with Python 2 as the default and b) it's a single code block that can
easily be removed. If the decision is for Python 3, I'd like at least to
add a check that catches this and prints a message, rather than leaving
someone with a cryptic runtime error that makes them think the script is
broken;
- Michael Paquier, do you have any other comments? If not, I'll adjust
the documentation to use the URLs you have indicated. If you are
downloading via curl or wget, the URL I used is the proper one. It gives
you the XML file, whereas the other saves the HTML interface, leading to
errors if you try to run it. I'll also add this to the documentation.

Once I have clarification on these, I'll update the patches.

Thanks,
Hugh

#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Hugh Ranalli (#33)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On 2019-Jan-03, Hugh Ranalli wrote:

I realise this is an incredibly minor component of the PostgreSQL
infrastructure, but as I don't want to hold up reviewers, may I ask:

- It seems we have two votes for Python 3 only, and one for Python 2/3.
I lean toward Python 2/3 myself because: a) many distributions still ship
with Python 2 as the default and b) it's a single code block that can
easily be removed. If the decision is for Python 3, I'd like at least to
add a check that catches this and prints a message, rather than leaving
someone with a cryptic runtime error that makes them think the script is
broken;

I kinda agree with Peter that this is a fringe, rarely run program where
the python3 requirement is unlikely to be onerous, but since the 2/3
compatibility is so little code, I would opt for keeping it for the time
being. We can remove it in a couple of years.

- Michael Paquier, do you have any other comments? If not, I'll adjust
the documentation to use the URLs you have indicated. If you are
downloading via curl or wget, the URL I used is the proper one. It gives
you the XML file, whereas the other saves the HTML interface, leading to
errors if you try to run it. I'll also add this to the documentation.

I think the point is that if the committee updates with a further
version of the file, how do you find the new version? We need a URL
that's one step removed from the final file, so that we can see if we
need to update it. Maybe we can provide both URLs for convenience.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#34)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I think the point is that if the committee updates with a further
version of the file, how do you find the new version? We need a URL
that's one step removed from the final file, so that we can see if we
need to update it. Maybe we can provide both URLs for convenience.

+1. Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.

regards, tom lane

#36Hugh Ranalli
hugh@whtc.ca
In reply to: Tom Lane (#35)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I think the point is that if the committee updates with a further
version of the file, how do you find the new version? We need a URL
that's one step removed from the final file, so that we can see if we
need to update it. Maybe we can provide both URLs for convenience.

+1. Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.

Yes, this is what I was thinking. I was integrating this into my installer,
used the "new" URL provided to download the file, and spent several minutes
wondering why the script was failing (and what I had broken in it), before
realising what had happened.

#37Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Alvaro Herrera (#34)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On 03/01/2019 19:19, Alvaro Herrera wrote:

I kinda agree with Peter that this is a fringe, rarely run program where
the python3 requirement is unlikely to be onerous, but since the 2/3
compatibility is so little code, I would opt for keeping it for the time
being. We can remove it in a couple of years.

OK, committed with the compat layer. I also fixed up sort-test.py for
Python 3, so now everything in the source should support Python 3.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#38Michael Paquier
michael@paquier.xyz
In reply to: Hugh Ranalli (#36)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Thu, Jan 03, 2019 at 04:48:33PM -0500, Hugh Ranalli wrote:

On Thu, 3 Jan 2019 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I think the point is that if the committee updates with a further
version of the file, how do you find the new version? We need a URL
that's one step removed from the final file, so that we can see if we
need to update it. Maybe we can provide both URLs for convenience.

+1. Could be phrased along the lines of "documents are at URL1,
currently synced with URL2" so that it's clear that URL2 should
be updated when we re-sync with a newer release.

Yes, this is what I was thinking. I was integrating this into my installer,
used the "new" URL provided to download the file, and spent several minutes
wondering why the script was failing (and what I had broken in it), before
realising what had happened.

I think that we could just use the URLs I am mentioning here:
/messages/by-id/20181227064958.GK2106@paquier.xyz

I haven't been able to finish what I wanted for the proposed patch set
yet, but what I was thinking about is to include:
1) The root URL where all the release folders are present
2) The full URL of the current Latin-ASCII.xml being used for the
generation, not as a URL pointing to the latest version, but as a URL
pointing to an exact version in time (I doubt that a released version
never changes in this tree, but who knows..).
3) The version used to generate the rules.
--
Michael

#39Hugh Ranalli
hugh@whtc.ca
In reply to: Michael Paquier (#38)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Fri, 4 Jan 2019 at 08:00, Michael Paquier <michael@paquier.xyz> wrote:

I haven't been able to finish what I wanted for the proposed patch set
yet, but what I was thinking about is to include:
1) The root URL where all the release folders are present
2) The full URL of the current Latin-ASCII.xml being used for the
generation, not as a URL pointing to the latest version, but as a URL
pointing to an exact version in time (I doubt that a released version
never changes in this tree, but who knows..).
3) The version used to generate the rules.

Hi Michael,
I think we're on the same page. I'll wait for you to finish your review and
provide any further comments before I make any changes.

Thanks,
Hugh

#40Michael Paquier
michael@paquier.xyz
In reply to: Hugh Ranalli (#39)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hi Hugh,

On Fri, Jan 04, 2019 at 11:29:42AM -0500, Hugh Ranalli wrote:

I think we're on the same page. I'll wait for you to finish your review and
provide any further comments before I make any changes.

I have been doing a bit more than a review by studying by myself the
new format and the old format, and the way we could do things in the
XML parsing part, and hacked the code by myself. On top of the
incorrect URL for Latin-ASCII.xml, I have noticed as well that there
should be only one block transforms/transform/tRule in the source, so
I think that we should add an assertion on that as a sanity check. I
have also changed the code to use splitlines(), which is more portable
across platforms, and added an extra regression test for the new
characters added to unaccent.rules. This does not close this thread
but we can support the new format this way. I have also documented
the way to browse the full set of releases for Latin-ASCII.xml, and
precisely which version has been used for this patch.

This does not close yet the part for diacritical characters, but
supporting the new format is a step into this direction. What do
you think?
--
Michael

Attachments:

unaccent-format-update.patchtext/x-diff; charset=utf-8Download
diff --git a/contrib/unaccent/expected/unaccent.out b/contrib/unaccent/expected/unaccent.out
index 0835e141af..44d70771ac 100644
--- a/contrib/unaccent/expected/unaccent.out
+++ b/contrib/unaccent/expected/unaccent.out
@@ -25,6 +25,12 @@ SELECT unaccent('ЁЖИК');
  ЕЖИК
 (1 row)
 
+SELECT unaccent('˃');
+ unaccent 
+----------
+ >
+(1 row)
+
 SELECT unaccent('unaccent', 'foobar');
  unaccent 
 ----------
@@ -43,6 +49,12 @@ SELECT unaccent('unaccent', 'ЁЖИК');
  ЕЖИК
 (1 row)
 
+SELECT unaccent('unaccent', '˃');
+ unaccent 
+----------
+ >
+(1 row)
+
 SELECT ts_lexize('unaccent', 'foobar');
  ts_lexize 
 -----------
@@ -61,3 +73,9 @@ SELECT ts_lexize('unaccent', 'ЁЖИК');
  {ЕЖИК}
 (1 row)
 
+SELECT ts_lexize('unaccent', '˃');
+ ts_lexize 
+-----------
+ {>}
+(1 row)
+
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index c9aef490ae..0a181f6857 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -20,8 +20,13 @@
 # option is enabled, the XML file of this transliterator [2] -- given as a
 # command line argument -- will be parsed and used.
 #
+# Ideally you should use the latest release for each data set.  For
+# Latin-ASCII.xml, the latest data set released can be browsed directly
+# via [3].  Note that this script is compatible with at least release 29.
+#
 # [1] http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
+# [2] http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
+# [3] https://unicode.org/cldr/trac/browser/tags
 
 # BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
 # The approach is to be Python3 compatible with Python2 "backports".
@@ -140,8 +145,18 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
     transliterationTree = ET.parse(latinAsciiFilePath)
     transliterationTreeRoot = transliterationTree.getroot()
 
-    for rule in transliterationTreeRoot.findall("./transforms/transform/tRule"):
-        matches = rulePattern.search(rule.text)
+    # Fetch all the transformation rules.  Since release 29 of Latin-ASCII.xml
+    # all the transliteration rules are located in a single tRule block with
+    # all rules separated into separate lines.
+    blockRules = transliterationTreeRoot.findall("./transforms/transform/tRule")
+    assert(len(blockRules) == 1)
+
+    # Split the block of rules into one element per line.
+    rules = blockRules[0].text.splitlines()
+
+    # And finish the processing of each individual rule.
+    for rule in rules:
+        matches = rulePattern.search(rule)
 
         # The regular expression capture four groups corresponding
         # to the characters.
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index ba72ab6261..d7d3a87e87 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -8,11 +8,14 @@ SET client_encoding TO 'UTF8';
 SELECT unaccent('foobar');
 SELECT unaccent('ёлка');
 SELECT unaccent('ЁЖИК');
+SELECT unaccent('˃');
 
 SELECT unaccent('unaccent', 'foobar');
 SELECT unaccent('unaccent', 'ёлка');
 SELECT unaccent('unaccent', 'ЁЖИК');
+SELECT unaccent('unaccent', '˃');
 
 SELECT ts_lexize('unaccent', 'foobar');
 SELECT ts_lexize('unaccent', 'ёлка');
 SELECT ts_lexize('unaccent', 'ЁЖИК');
+SELECT ts_lexize('unaccent', '˃');
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 76e4e69beb..7ce25eef03 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -399,6 +399,21 @@
 ʦ	ts
 ʪ	ls
 ʫ	lz
+ʹ	'
+ʺ	"
+ʻ	'
+ʼ	'
+ʽ	'
+˂	<
+˃	>
+˄	^
+ˆ	^
+ˈ	'
+ˋ	`
+ː	:
+˖	+
+˗	-
+˜	~
 Ά	Α
 Έ	Ε
 Ή	Η
#41Hugh Ranalli
hugh@whtc.ca
In reply to: Michael Paquier (#40)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, 8 Jan 2019 at 22:53, Michael Paquier <michael@paquier.xyz> wrote:

I have been doing a bit more than a review by studying by myself the
new format and the old format, and the way we could do things in the
XML parsing part, and hacked the code by myself. On top of the
incorrect URL for Latin-ASCII.xml, I have noticed as well that there
should be only one block transforms/transform/tRule in the source, so
I think that we should add an assertion on that as a sanity check. I
have also changed the code to use splitlines(), which is more portable
across platforms, and added an extra regression test for the new
characters added to unaccent.rules. This does not close this thread
but we can support the new format this way. I have also documented
the way to browse the full set of releases for Latin-ASCII.xml, and
precisely which version has been used for this patch.

This does not close yet the part for diacritical characters, but
supporting the new format is a step into this direction. What do
you think?

HI Michael,
Thank you for putting so much effort into this. I think that looks great.
When I was doing this, I discovered that I could parse both pre- and post-
r29 versions, so I went with that, but I agree that there's probably no
good reason to do so.

And thank you for the information on splitlines; that's a method I've
overlooked. .split('\n') should be identical, if python is, as usual,
compiled with universal newlines support, but it's nice to have a method
guaranteed to work in all instances.

Best wishes,
Hugh

#42Michael Paquier
michael@paquier.xyz
In reply to: Hugh Ranalli (#41)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Wed, Jan 09, 2019 at 09:52:05PM -0500, Hugh Ranalli wrote:

Thank you for putting so much effort into this. I think that looks great.
When I was doing this, I discovered that I could parse both pre- and post-
r29 versions, so I went with that, but I agree that there's probably no
good reason to do so.

OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters. Patch 3 has a conflict for
the regression tests, so a rebase would be needed. That's not a big
deal though to resolve the conflict. I am also a bit confused by the
newly-generated unaccent.rules. Why nothing shows up for the second
column (around line 414 for example)? Shouldn't we have mapping
characters?
--
Michael

#43Hugh Ranalli
hugh@whtc.ca
In reply to: Michael Paquier (#42)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Thu, 10 Jan 2019 at 01:09, Michael Paquier <michael@paquier.xyz> wrote:

OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters. Patch 3 has a conflict for
the regression tests, so a rebase would be needed. That's not a big
deal though to resolve the conflict. I am also a bit confused by the
newly-generated unaccent.rules. Why nothing shows up for the second
column (around line 414 for example)? Shouldn't we have mapping
characters?

That concerned me, as well. I have confirmed the lines are not empty. If
you open the file in a text editor (I'm using KDE's Kate), and insert a
standard character at the beginning of one of those lines, the diacritic
then appears, combined with the character you just entered. The only
program I've found that wants to display them on their own is vi (and I
only just thought of trying that).

From what I can tell, this is likely a font issue:

- http://unicode.org/faq/char_combmark.html#12b
-
https://superuser.com/questions/852901/why-are-some-combining-diacritics-shifted-to-the-right-in-some-programs

Hugh

#44Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#42)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Thomas,

On Thu, Jan 10, 2019 at 03:09:45PM +0900, Michael Paquier wrote:

OK, committed then. I have yet to study yet the other part of the
proposal regarding diatritical characters. Patch 3 has a conflict for
the regression tests, so a rebase would be needed. That's not a big
deal though to resolve the conflict. I am also a bit confused by the
newly-generated unaccent.rules. Why nothing shows up for the second
column (around line 414 for example)? Shouldn't we have mapping
characters?

You are registered as a reviewer and committer of the last patch of
this thread:
https://commitfest.postgresql.org/21/1924/

Are you planning to look at it or should I jump in? I have not looked
at the patch status in depth yet.
--
Michael

#45Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Michael Paquier (#44)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, Jan 28, 2019 at 7:45 PM Michael Paquier <michael@paquier.xyz> wrote:

You are registered as a reviewer and committer of the last patch of
this thread:
https://commitfest.postgresql.org/21/1924/

Are you planning to look at it or should I jump in? I have not looked
at the patch status in depth yet.

Thanks for the reminder. I looked at this a couple of weeks ago when
you ping me off-list, but I see we're still waiting for a rebase.
Hugh, can you please post a new patch? The approach looks right to me
(simply replace the composing diacritics with nothing), so if you post
a new version I'll double check with that test case I came up with
earlier, and then I'll be happy to commit it.

--
Thomas Munro
http://www.enterprisedb.com

#46Hugh Ranalli
hugh@whtc.ca
In reply to: Thomas Munro (#45)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, 28 Jan 2019 at 02:27, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:

Thanks for the reminder. I looked at this a couple of weeks ago when
you ping me off-list, but I see we're still waiting for a rebase.
Hugh, can you please post a new patch? The approach looks right to me
(simply replace the composing diacritics with nothing), so if you post
a new version I'll double check with that test case I came up with
earlier, and then I'll be happy to commit it.

Hi Thomas,
My apologies; I hadn't realised I was supposed to do this. A rebased
version of patch 03 is attached. Let me know if you have any questions or
need any changes.

Best wishes,
Hugh

Attachments:

03-generate_unaccent_rules-remove-combining-diacritical-accents-02.patchtext/x-patch; charset=UTF-8; name=03-generate_unaccent_rules-remove-combining-diacritical-accents-02.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 4419a77..58b6e7d 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -61,8 +61,25 @@ PLAIN_LETTER_RANGES = ((ord('a'), ord('z')), # Latin lower case
                        (0x03b1, 0x03c9),     # GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMEGA
                        (0x0391, 0x03a9))     # GREEK CAPITAL LETTER ALPHA, GREEK CAPITAL LETTER OMEGA
 
+# Combining marks follow a "base" character, and result in a composite
+# character. Example: "U&'A\0300'"produces "À".There are three types of
+# combining marks: enclosing (Me), non-spacing combining (Mn), spacing
+# combining (Mc). We identify the ranges of marks we feel safe removing.
+# References:
+#   https://en.wikipedia.org/wiki/Combining_character
+#   https://www.unicode.org/charts/PDF/U0300.pdf
+#   https://www.unicode.org/charts/PDF/U20D0.pdf
+COMBINING_MARK_RANGES = ((0x0300, 0x0362),  # Mn: Accents, IPA
+                         (0x20dd, 0x20E0),  # Me: Symbols
+                         (0x20e2, 0x20e4),) # Me: Screen, keycap, triangle
+
 def print_record(codepoint, letter):
-    print (chr(codepoint) + "\t" + letter)
+    if letter:
+        output = chr(codepoint) + "\t" + letter
+    else:
+        output = chr(codepoint)
+
+    print(output)
 
 class Codepoint:
     def __init__(self, id, general_category, combining_ids):
@@ -70,6 +87,16 @@ class Codepoint:
         self.general_category = general_category
         self.combining_ids = combining_ids
 
+def is_mark_to_remove(codepoint):
+    """Return true if this is a combining mark to remove."""
+    if not is_mark(codepoint):
+        return False
+
+    for begin, end in COMBINING_MARK_RANGES:
+        if codepoint.id >= begin and codepoint.id <= end:
+            return True
+    return False
+
 def is_plain_letter(codepoint):
     """Return true if codepoint represents a "plain letter"."""
     for begin, end in PLAIN_LETTER_RANGES:
@@ -234,6 +261,8 @@ def main(args):
                              "".join(chr(combining_codepoint.id)
                                      for combining_codepoint \
                                      in get_plain_letters(codepoint, table))))
+        elif is_mark_to_remove(codepoint):
+            charactersSet.add((codepoint.id, None))
 
     # add CLDR Latin-ASCII characters
     if not args.noLigaturesExpansion:
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index c671827..2ae097f 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -9,13 +9,16 @@ SELECT unaccent('foobar');
 SELECT unaccent('ёлка');
 SELECT unaccent('ЁЖИК');
 SELECT unaccent('˃˖˗˜');
+SELECT unaccent('À');  -- Remove combining diacritical 0x0300
 
 SELECT unaccent('unaccent', 'foobar');
 SELECT unaccent('unaccent', 'ёлка');
 SELECT unaccent('unaccent', 'ЁЖИК');
 SELECT unaccent('unaccent', '˃˖˗˜');
+SELECT unaccent('unaccent', 'À');
 
 SELECT ts_lexize('unaccent', 'foobar');
 SELECT ts_lexize('unaccent', 'ёлка');
 SELECT ts_lexize('unaccent', 'ЁЖИК');
 SELECT ts_lexize('unaccent', '˃˖˗˜');
+SELECT ts_lexize('unaccent', 'À');
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 7ce25ee..9982640 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -414,6 +414,105 @@
 ˖	+
 ˗	-
 ˜	~
+̀
+́
+̂
+̃
+̄
+̅
+̆
+̇
+̈
+̉
+̊
+̋
+̌
+̍
+̎
+̏
+̐
+̑
+̒
+̓
+̔
+̕
+̖
+̗
+̘
+̙
+̚
+̛
+̜
+̝
+̞
+̟
+̠
+̡
+̢
+̣
+̤
+̥
+̦
+̧
+̨
+̩
+̪
+̫
+̬
+̭
+̮
+̯
+̰
+̱
+̲
+̳
+̴
+̵
+̶
+̷
+̸
+̹
+̺
+̻
+̼
+̽
+̾
+̿
+̀
+́
+͂
+̓
+̈́
+ͅ
+͆
+͇
+͈
+͉
+͊
+͋
+͌
+͍
+͎
+͏
+͐
+͑
+͒
+͓
+͔
+͕
+͖
+͗
+͘
+͙
+͚
+͛
+͜
+͝
+͞
+͟
+͠
+͡
+͢
 Ά	Α
 Έ	Ε
 Ή	Η
@@ -982,6 +1081,13 @@
 ₧	Pts
 ₹	Rs
 ₺	TL
+⃝
+⃞
+⃟
+⃠
+⃢
+⃣
+⃤
 ℀	a/c
 ℁	a/s
 ℂ	C
#47Michael Paquier
michael@paquier.xyz
In reply to: Hugh Ranalli (#46)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote:

My apologies; I hadn't realised I was supposed to do this. A rebased
version of patch 03 is attached. Let me know if you have any questions or
need any changes.

Moved to next CF.
--
Michael

#48Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Michael Paquier (#47)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Fri, Feb 1, 2019 at 2:44 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jan 28, 2019 at 03:26:12PM -0500, Hugh Ranalli wrote:

My apologies; I hadn't realised I was supposed to do this. A rebased
version of patch 03 is attached. Let me know if you have any questions or
need any changes.

Moved to next CF.

I checked that the script generates identical output on my machine.
Committed. Thanks!

--
Thomas Munro
http://www.enterprisedb.com

#49raam narayana
raam.soft@gmail.com
In reply to: Thomas Munro (#48)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hi,

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

#50Thomas Munro
thomas.munro@enterprisedb.com
In reply to: raam narayana (#49)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, Feb 11, 2019 at 7:07 AM raam narayana <raam.soft@gmail.com> wrote:

After the latest commit in master branch, I was trying to test the python script. Ironically I still see that the output from the script is completely different from the unaccent.rules file content. Am I missing anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt --latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,

How does it differ? Can you please share the output you get? I used
Python 2.7 on a Mac, exactly those input files, and my output matched
Hugh's.

--
Thomas Munro
http://www.enterprisedb.com

#51Hugh Ranalli
hugh@whtc.ca
In reply to: raam narayana (#49)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:

Hi,

After the latest commit in master branch, I was trying to test the python
script. Ironically I still see that the output from the script is
completely different from the unaccent.rules file content. Am I missing
anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the
data files given above :
- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
- Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program
under Python 2.7. So yes, more information would help. Unfortunately I
don't have a Windows Python environment readily available, but could set
one up if I had to.

Thanks,
Hugh

#52Ramanarayana
raam.soft@gmail.com
In reply to: Hugh Ranalli (#51)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in
python 3.7(and may be only in windows as you were not getting the issue)
and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is set
to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the
patch for the same.Let me know if it makes sense

Regards,
Ram.

On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <hugh@whtc.ca> wrote:

On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.soft@gmail.com> wrote:

Hi,

After the latest commit in master branch, I was trying to test the python
script. Ironically I still see that the output from the script is
completely different from the unaccent.rules file content. Am I missing
anything.My testing includes the following

Downloaded the following files

http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt

http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml

Executed the below python script

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am using python 3.7.1 and running on Windows 10 Platform

The new status of this patch is: Needs review

Hi Raam,
I just ran generate_unaccent_rules.py under two environments, using the
data files given above :
- Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04)
- Python 3.6.7 on Ubuntu 18.04

In both cases, the output was identical to that generated by the program
under Python 2.7. So yes, more information would help. Unfortunately I
don't have a Windows Python environment readily available, but could set
one up if I had to.

Thanks,
Hugh

--
Cheers
Ram 4.0

Attachments:

generate_unaccent_rules-remove-combining-diacritical-accents-03.patchapplication/octet-stream; name=generate_unaccent_rules-remove-combining-diacritical-accents-03.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 58b6e7d..65fe73e 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -46,7 +46,8 @@
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
-
+else:
+	sys.stdout.reconfigure(encoding='utf-8')
 import re
 import argparse
 import sys
#53Michael Paquier
michael@paquier.xyz
In reply to: Ramanarayana (#52)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:

I tested the script in python 2.7 and it works perfect. The problem is in
python 3.7(and may be only in windows as you were not getting the issue)
and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is set
to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the
patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

#54Ramanarayana
raam.soft@gmail.com
In reply to: Michael Paquier (#53)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hi Michael,
The issue was that the python script was working in python 2 but not in
python 3 in Windows. This is because the python script writes the final
output to stdout and stdout encoding is set to utf-8 only for python 2 but
not python 3.If no encoding is set for stdout it takes the encoding from
the Operating system.Default encoding in linux and windows might be
different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:

I tested the script in python 2.7 and it works perfect. The problem is in
python 3.7(and may be only in windows as you were not getting the issue)
and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is

set

to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the
patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

--
Cheers
Ram 4.0

#55Hugh Ranalli
hugh@whtc.ca
In reply to: Ramanarayana (#54)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, 12 Feb 2019 at 08:54, Ramanarayana <raam.soft@gmail.com> wrote:

Hi Michael,
The issue was that the python script was working in python 2 but not in
python 3 in Windows. This is because the python script writes the final
output to stdout and stdout encoding is set to utf-8 only for python 2 but
not python 3.If no encoding is set for stdout it takes the encoding from
the Operating system.Default encoding in linux and windows might be
different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:

I tested the script in python 2.7 and it works perfect. The problem is

in

python 3.7(and may be only in windows as you were not getting the issue)
and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is

set

to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find

the

patch for the same.Let me know if it makes sense

Isn't that because Windows encoding becomes cp1252, utf16 or such?
FWIW, on Debian SID with Python 3.7, I get the correct output, and no
diffs on HEAD. Perhaps it would make sense to use open() on the
different files with encoding='utf-8' to avoid any kind of problems?
--
Michael

I can't look at this today, but will fire up Windows and Python tomorrow,
look at Ram's patch, and see what is going on. I'll also look at how we
open the input files, to see if we should supply an encoding. It makes
sense those input files will only make sense in UTF-8 anyway.

Ram, thanks for catching this issue.,

Hugh

#56Hugh Ranalli
hugh@whtc.ca
In reply to: Ramanarayana (#52)
1 attachment(s)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Mon, 11 Feb 2019 at 15:57, Ramanarayana <raam.soft@gmail.com> wrote:

Hi Hugh,

I tested the script in python 2.7 and it works perfect. The problem is in
python 3.7(and may be only in windows as you were not getting the issue)
and I was getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is
set to utf-8 only if python version is <=2.

I have made the same change for python version 3 as well. Please find the
patch for the same.Let me know if it makes sense

Regards,
Ram

Hi Ram,
I took a look at this, and unfortunately the proposed fix breaks Python 2
(sys.stdout.encoding isn't a writable attribute in Python 2) :-(. I've
attached a patch which is compatible with both versions, and have confirmed
that the output is identical across Python 2 and 3 and across both Windows
and Linux. The output on Windows and Linux is identical, once the
difference in line endings is accounted for.

I've also opened the Unicode data file in UTF-8 and added a "with" block
which ensures we close the file when we are done with it. The change makes
the Python2 compatibility a little more complex (2 blocks to remove), but
it's the cleanest I could achieve.

The attached patch goes on top of patch 02 (not on top of the broken,
committed 03). I'm hoping that's not a problem. If it is, let me know and
I'll factor out the changes.

Please let me know if you have any questions.

Best wishes,
Hugh

Attachments:

generate_unaccent_rules-remove-combining-diacritical-accents-04.patchtext/x-patch; charset=UTF-8; name=generate_unaccent_rules-remove-combining-diacritical-accents-04.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 4419a77..7a0a96e 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -32,9 +32,15 @@
 # The approach is to be Python3 compatible with Python2 "backports".
 from __future__ import print_function
 from __future__ import unicode_literals
+# END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+
+import argparse
 import codecs
+import re
 import sys
+import xml.etree.ElementTree as ET
 
+# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
 if sys.version_info[0] <= 2:
     # Encode stdout as UTF-8, so we can just print to it
     sys.stdout = codecs.getwriter('utf8')(sys.stdout)
@@ -45,12 +51,9 @@ if sys.version_info[0] <= 2:
     # Python 2 and 3 compatible bytes call
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
+else:
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
-
-import re
-import argparse
-import sys
-import xml.etree.ElementTree as ET
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 # The ranges of Unicode characters that we consider to be "plain letters".
 # For now we are being conservative by including only Latin and Greek.  This
@@ -61,8 +64,25 @@ PLAIN_LETTER_RANGES = ((ord('a'), ord('z')), # Latin lower case
                        (0x03b1, 0x03c9),     # GREEK SMALL LETTER ALPHA, GREEK SMALL LETTER OMEGA
                        (0x0391, 0x03a9))     # GREEK CAPITAL LETTER ALPHA, GREEK CAPITAL LETTER OMEGA
 
+# Combining marks follow a "base" character, and result in a composite
+# character. Example: "U&'A\0300'"produces "À".There are three types of
+# combining marks: enclosing (Me), non-spacing combining (Mn), spacing
+# combining (Mc). We identify the ranges of marks we feel safe removing.
+# References:
+#   https://en.wikipedia.org/wiki/Combining_character
+#   https://www.unicode.org/charts/PDF/U0300.pdf
+#   https://www.unicode.org/charts/PDF/U20D0.pdf
+COMBINING_MARK_RANGES = ((0x0300, 0x0362),  # Mn: Accents, IPA
+                         (0x20dd, 0x20E0),  # Me: Symbols
+                         (0x20e2, 0x20e4),) # Me: Screen, keycap, triangle
+
 def print_record(codepoint, letter):
-    print (chr(codepoint) + "\t" + letter)
+    if letter:
+        output = chr(codepoint) + "\t" + letter
+    else:
+        output = chr(codepoint)
+
+    print(output)
 
 class Codepoint:
     def __init__(self, id, general_category, combining_ids):
@@ -70,6 +90,16 @@ class Codepoint:
         self.general_category = general_category
         self.combining_ids = combining_ids
 
+def is_mark_to_remove(codepoint):
+    """Return true if this is a combining mark to remove."""
+    if not is_mark(codepoint):
+        return False
+
+    for begin, end in COMBINING_MARK_RANGES:
+        if codepoint.id >= begin and codepoint.id <= end:
+            return True
+    return False
+
 def is_plain_letter(codepoint):
     """Return true if codepoint represents a "plain letter"."""
     for begin, end in PLAIN_LETTER_RANGES:
@@ -206,21 +236,22 @@ def main(args):
     charactersSet = set()
 
     # read file UnicodeData.txt
-    unicodeDataFile = open(args.unicodeDataFilePath, 'r')
-
-    # read everything we need into memory
-    for line in unicodeDataFile:
-        fields = line.split(";")
-        if len(fields) > 5:
-            # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
-            general_category = fields[2]
-            decomposition = fields[5]
-            decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
-            id = int(fields[0], 16)
-            combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
-            codepoint = Codepoint(id, general_category, combining_ids)
-            table[id] = codepoint
-            all.append(codepoint)
+    with codecs.open(
+      args.unicodeDataFilePath, mode='r', encoding='UTF-8',
+      ) as unicodeDataFile:
+        # read everything we need into memory
+        for line in unicodeDataFile:
+            fields = line.split(";")
+            if len(fields) > 5:
+                # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
+                general_category = fields[2]
+                decomposition = fields[5]
+                decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
+                id = int(fields[0], 16)
+                combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
+                codepoint = Codepoint(id, general_category, combining_ids)
+                table[id] = codepoint
+                all.append(codepoint)
 
     # walk through all the codepoints looking for interesting mappings
     for codepoint in all:
@@ -234,6 +265,8 @@ def main(args):
                              "".join(chr(combining_codepoint.id)
                                      for combining_codepoint \
                                      in get_plain_letters(codepoint, table))))
+        elif is_mark_to_remove(codepoint):
+            charactersSet.add((codepoint.id, None))
 
     # add CLDR Latin-ASCII characters
     if not args.noLigaturesExpansion:
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index c671827..2ae097f 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -9,13 +9,16 @@ SELECT unaccent('foobar');
 SELECT unaccent('ёлка');
 SELECT unaccent('ЁЖИК');
 SELECT unaccent('˃˖˗˜');
+SELECT unaccent('À');  -- Remove combining diacritical 0x0300
 
 SELECT unaccent('unaccent', 'foobar');
 SELECT unaccent('unaccent', 'ёлка');
 SELECT unaccent('unaccent', 'ЁЖИК');
 SELECT unaccent('unaccent', '˃˖˗˜');
+SELECT unaccent('unaccent', 'À');
 
 SELECT ts_lexize('unaccent', 'foobar');
 SELECT ts_lexize('unaccent', 'ёлка');
 SELECT ts_lexize('unaccent', 'ЁЖИК');
 SELECT ts_lexize('unaccent', '˃˖˗˜');
+SELECT ts_lexize('unaccent', 'À');
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 7ce25ee..9982640 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -414,6 +414,105 @@
 ˖	+
 ˗	-
 ˜	~
+̀
+́
+̂
+̃
+̄
+̅
+̆
+̇
+̈
+̉
+̊
+̋
+̌
+̍
+̎
+̏
+̐
+̑
+̒
+̓
+̔
+̕
+̖
+̗
+̘
+̙
+̚
+̛
+̜
+̝
+̞
+̟
+̠
+̡
+̢
+̣
+̤
+̥
+̦
+̧
+̨
+̩
+̪
+̫
+̬
+̭
+̮
+̯
+̰
+̱
+̲
+̳
+̴
+̵
+̶
+̷
+̸
+̹
+̺
+̻
+̼
+̽
+̾
+̿
+̀
+́
+͂
+̓
+̈́
+ͅ
+͆
+͇
+͈
+͉
+͊
+͋
+͌
+͍
+͎
+͏
+͐
+͑
+͒
+͓
+͔
+͕
+͖
+͗
+͘
+͙
+͚
+͛
+͜
+͝
+͞
+͟
+͠
+͡
+͢
 Ά	Α
 Έ	Ε
 Ή	Η
@@ -982,6 +1081,13 @@
 ₧	Pts
 ₹	Rs
 ₺	TL
+⃝
+⃞
+⃟
+⃠
+⃢
+⃣
+⃤
 ℀	a/c
 ℁	a/s
 ℂ	C
#57Ramanarayana
raam.soft@gmail.com
In reply to: Hugh Ranalli (#56)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

Hi Hugh,

The patch I submitted was tested both in python 2 and 3 and it worked for
me.The single line of code
added in the patch runs only in python 3. I dont think it can break
python2. Would like to see the error you got in python 2 Good to know the
reported issue is a valid one in windows.I tested your patch as well and
it is also working fine.
--
Cheers
Ram 4.0

#58Michael Paquier
michael@paquier.xyz
In reply to: Ramanarayana (#57)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Sun, Feb 17, 2019 at 12:45:39PM +0530, Ramanarayana wrote:

The patch I submitted was tested both in python 2 and 3 and it worked for
me.The single line of code
added in the patch runs only in python 3. I dont think it can break
python2. Would like to see the error you got in python 2 Good to know the
reported issue is a valid one in windows.I tested your patch as well and
it is also working fine.

I can see that the commit fest entry associated to this thread has
been switched back from "committed" to "Needs Review" with Thomas
Munro still associated as committer. The thing is that we have
already committed all the bits discussed here, so I am switching back
the status as "committed", which reflects the state of the thread. If
you have a set of fixes for what has been pushed regarding Windows and
Python 2/3 capabilities, I would suggest to create a new entry with
yourself as the author. Spawning a new thread would be also nice so
as you attract the correct audience, this thread about initially
diacritical character support for unaccent has been used more than
enough now.

Python 2/3 support for this script is easy enough to check on Linux,
and now you are adding Windows in the mix...

Thanks,
--
Michael

#59Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#14)
Re: BUG #15548: Unaccent does not remove combining diacritical characters

On Tue, Dec 3, 2019 at 9:57 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hugh Ranalli <hugh@whtc.ca> writes:

The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set.

Ah-hah.

I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?

(b) seems sufficient to me, but perhaps someone else has a different
opinion.

Whichever we do, I think it should be a separate patch from the feature
addition for combining diacriticals, just to keep the commit history
clear.

+1 for updating to the latest file from time to time. After
http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
our special_cases() function will have just the two Cyrillic
characters, which should almost certainly be handled by adding
Cyrillic to the ranges we handle via the usual code path, and DEGREE
CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
extracted from Unicode.txt (or we could just forget about them), and
then we could drop special_cases().

Aha, CLDR 36 included that change, so when we update we can drop a special case.