Unaccent extension python script Issue in Windows

Started by Ramanarayanaalmost 7 years ago10 messages
#1Ramanarayana
raam.soft@gmail.com

Hi Hackers,

In master branch, unaccent extension is having issue with the below python
script.This issue is only in windows 10 and python 3.

python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
--latin-ascii-file Latin-ASCII.xml > unaccent.rules

I am getting the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
position 0: character maps to <undefined>

I went through the python script and found that the stdout encoding is set
to utf-8 only if python version is <=2. The same needs to be done for
python 3
--
Cheers
Ram 4.0

#2Michael Paquier
michael@paquier.xyz
In reply to: Ramanarayana (#1)
Re: Unaccent extension python script Issue in Windows

On Mon, Mar 11, 2019 at 09:54:45PM +0530, Ramanarayana wrote:

I went through the python script and found that the stdout encoding is set
to utf-8 only if python version is <=2. The same needs to be done for
python 3

If you send a patch for that, how would it look like? Could you also
register any patch produced to the future commit fest? It is here:
https://commitfest.postgresql.org/23/
--
Michael

#3Hugh Ranalli
hugh@whtc.ca
In reply to: Michael Paquier (#2)
Re: Unaccent extension python script Issue in Windows

On Mon, 11 Mar 2019 at 22:29, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Mar 11, 2019 at 09:54:45PM +0530, Ramanarayana wrote:

I went through the python script and found that the stdout encoding is

set

to utf-8 only if python version is <=2. The same needs to be done for
python 3

If you send a patch for that, how would it look like? Could you also
register any patch produced to the future commit fest? It is here:
https://commitfest.postgresql.org/23/

We had integrated that into a patch on Bug#15548
(generate_unaccent_rules-remove-combining-diacritical-accents-04.patch),
but there had been issues as overlapping patches had already been
committed. I can try to abstract out these changes in the few days.
Hugh

#4Ramanarayana
raam.soft@gmail.com
In reply to: Hugh Ranalli (#3)
1 attachment(s)
Re: Unaccent extension python script Issue in Windows

Hi Hugh,

I have abstracted out the windows compatibility changes from your patch to
a new patch and tested it. Added the patch to
https://commitfest.postgresql.org/23/

Please feel free to change it if it requires any changes.

Cheers
Ram 4.0

Attachments:

v1_unaccent_windows_compatibility.patchapplication/octet-stream; name=v1_unaccent_windows_compatibility.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 58b6e7d..aad6782 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -32,9 +32,15 @@
 # The approach is to be Python3 compatible with Python2 "backports".
 from __future__ import print_function
 from __future__ import unicode_literals
+# END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+
+import argparse
 import codecs
+import re
 import sys
+import xml.etree.ElementTree as ET
 
+# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
 if sys.version_info[0] <= 2:
     # Encode stdout as UTF-8, so we can just print to it
     sys.stdout = codecs.getwriter('utf8')(sys.stdout)
@@ -45,7 +51,9 @@ if sys.version_info[0] <= 2:
     # Python 2 and 3 compatible bytes call
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
+else:
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 import re
 import argparse
@@ -233,21 +241,22 @@ def main(args):
     charactersSet = set()
 
     # read file UnicodeData.txt
-    unicodeDataFile = open(args.unicodeDataFilePath, 'r')
-
-    # read everything we need into memory
-    for line in unicodeDataFile:
-        fields = line.split(";")
-        if len(fields) > 5:
-            # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
-            general_category = fields[2]
-            decomposition = fields[5]
-            decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
-            id = int(fields[0], 16)
-            combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
-            codepoint = Codepoint(id, general_category, combining_ids)
-            table[id] = codepoint
-            all.append(codepoint)
+    with codecs.open(
+      args.unicodeDataFilePath, mode='r', encoding='UTF-8',
+      ) as unicodeDataFile:
+        # read everything we need into memory
+        for line in unicodeDataFile:
+            fields = line.split(";")
+            if len(fields) > 5:
+                # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
+                general_category = fields[2]
+                decomposition = fields[5]
+                decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
+                id = int(fields[0], 16)
+                combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
+                codepoint = Codepoint(id, general_category, combining_ids)
+                table[id] = codepoint
+                all.append(codepoint)
 
     # walk through all the codepoints looking for interesting mappings
     for codepoint in all:
#5Hugh Ranalli
hugh@whtc.ca
In reply to: Ramanarayana (#4)
1 attachment(s)
Re: Unaccent extension python script Issue in Windows

Hi Ram,
Thanks for doing this; I've been overestimating my ability to get to things
over the last couple of weeks.

I've looked at the patch and have made one minor change. I had moved all
the imports up to the top, to keep them in one place (and I think some had
originally been used only by the Python 2 code. You added them there, but
didn't remove them from their original positions. So I've incorporated that
into your patch, attached as v2. I've tested this under Python 2 and 3 on
Linux, not Windows.

Everything else looks correct. I apologise for not having replied to your
question in the original bug report. I had intended to, but as I said,
there's been an increase in the things I need to juggle at the moment.

Best wishes,
Hugh

On Sat, 16 Mar 2019 at 22:58, Ramanarayana <raam.soft@gmail.com> wrote:

Show quoted text

Hi Hugh,

I have abstracted out the windows compatibility changes from your patch to
a new patch and tested it. Added the patch to
https://commitfest.postgresql.org/23/

Please feel free to change it if it requires any changes.

Cheers
Ram 4.0

Attachments:

v2_unaccent_windows_compatibility.patchtext/x-patch; charset=US-ASCII; name=v2_unaccent_windows_compatibility.patchDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 58b6e7d..7a0a96e 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -32,9 +32,15 @@
 # The approach is to be Python3 compatible with Python2 "backports".
 from __future__ import print_function
 from __future__ import unicode_literals
+# END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+
+import argparse
 import codecs
+import re
 import sys
+import xml.etree.ElementTree as ET
 
+# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
 if sys.version_info[0] <= 2:
     # Encode stdout as UTF-8, so we can just print to it
     sys.stdout = codecs.getwriter('utf8')(sys.stdout)
@@ -45,12 +51,9 @@ if sys.version_info[0] <= 2:
     # Python 2 and 3 compatible bytes call
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
+else:
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
-
-import re
-import argparse
-import sys
-import xml.etree.ElementTree as ET
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 # The ranges of Unicode characters that we consider to be "plain letters".
 # For now we are being conservative by including only Latin and Greek.  This
@@ -233,21 +236,22 @@ def main(args):
     charactersSet = set()
 
     # read file UnicodeData.txt
-    unicodeDataFile = open(args.unicodeDataFilePath, 'r')
-
-    # read everything we need into memory
-    for line in unicodeDataFile:
-        fields = line.split(";")
-        if len(fields) > 5:
-            # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
-            general_category = fields[2]
-            decomposition = fields[5]
-            decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
-            id = int(fields[0], 16)
-            combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
-            codepoint = Codepoint(id, general_category, combining_ids)
-            table[id] = codepoint
-            all.append(codepoint)
+    with codecs.open(
+      args.unicodeDataFilePath, mode='r', encoding='UTF-8',
+      ) as unicodeDataFile:
+        # read everything we need into memory
+        for line in unicodeDataFile:
+            fields = line.split(";")
+            if len(fields) > 5:
+                # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
+                general_category = fields[2]
+                decomposition = fields[5]
+                decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
+                id = int(fields[0], 16)
+                combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
+                codepoint = Codepoint(id, general_category, combining_ids)
+                table[id] = codepoint
+                all.append(codepoint)
 
     # walk through all the codepoints looking for interesting mappings
     for codepoint in all:
#6Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Hugh Ranalli (#5)
1 attachment(s)
Re: Unaccent extension python script Issue in Windows

Hello.

At Sun, 17 Mar 2019 20:23:05 -0400, Hugh Ranalli <hugh@whtc.ca> wrote in <CAAhbUMNoBLu7jAbyK5MK0LXEyt03PzNQt_Apkg0z9bsAjcLV4g@mail.gmail.com>

Hi Ram,
Thanks for doing this; I've been overestimating my ability to get to things
over the last couple of weeks.

I've looked at the patch and have made one minor change. I had moved all
the imports up to the top, to keep them in one place (and I think some had
originally been used only by the Python 2 code. You added them there, but
didn't remove them from their original positions. So I've incorporated that
into your patch, attached as v2. I've tested this under Python 2 and 3 on
Linux, not Windows.

Though I'm not sure the necessity of running the script on
Windows, the problem is not specific for Windows, but general one
that haven't accidentially found on non-Windows environment.

On CentOS7:

export LANG="ja_JP.EUCJP"
python <..snipped..>

..

UnicodeEncodeError: 'euc_jp' codec can't encode character '\xab' in position 0: illegal multibyte sequence

So this is not an issue with Windows but with python3.

The script generates identical files with the both versions of
python with the pach on Linux and Windows 7. Python3 on Windows
emits CRLF as a new line but it doesn't seem to harm. (I didn't
confirmed that due to extreme slowness of build from uncertain
reasons now..)

This patch contains irrelevant changes. The minimal required
change would be the attached. If you want refacotor the
UnicodeData reader or rearrange import sutff, it should be
separate patches.

It would be better use IOBase for Python3 especially for stdout
replacement but I didin't since it *is* working.

Everything else looks correct. I apologise for not having replied to your
question in the original bug report. I had intended to, but as I said,
there's been an increase in the things I need to juggle at the moment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v3_unaccent_python3_compatibility.patchtext/x-patch; charset=us-asciiDownload
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py
index 58b6e7deb7..0d645567b7 100644
--- a/contrib/unaccent/generate_unaccent_rules.py
+++ b/contrib/unaccent/generate_unaccent_rules.py
@@ -45,7 +45,9 @@ if sys.version_info[0] <= 2:
     # Python 2 and 3 compatible bytes call
     def bytes(source, encoding='ascii', errors='strict'):
         return source.encode(encoding=encoding, errors=errors)
+else:
 # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
+    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
 
 import re
 import argparse
@@ -233,7 +235,8 @@ def main(args):
     charactersSet = set()
 
     # read file UnicodeData.txt
-    unicodeDataFile = open(args.unicodeDataFilePath, 'r')
+    unicodeDataFile = codecs.open(
+        args.unicodeDataFilePath, mode='r', encoding='UTF-8')
 
     # read everything we need into memory
     for line in unicodeDataFile:
#7Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#6)
Re: Unaccent extension python script Issue in Windows

Hello.

At Mon, 18 Mar 2019 14:13:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190318.141334.186469242.horiguchi.kyotaro@lab.ntt.co.jp>

Hello.

At Sun, 17 Mar 2019 20:23:05 -0400, Hugh Ranalli <hugh@whtc.ca> wrote in <CAAhbUMNoBLu7jAbyK5MK0LXEyt03PzNQt_Apkg0z9bsAjcLV4g@mail.gmail.com>

Hi Ram,
Thanks for doing this; I've been overestimating my ability to get to things
over the last couple of weeks.

I've looked at the patch and have made one minor change. I had moved all
the imports up to the top, to keep them in one place (and I think some had
originally been used only by the Python 2 code. You added them there, but
didn't remove them from their original positions. So I've incorporated that
into your patch, attached as v2. I've tested this under Python 2 and 3 on
Linux, not Windows.

Though I'm not sure the necessity of running the script on
Windows, the problem is not specific for Windows, but general one
that haven't accidentially found on non-Windows environment.

On CentOS7:

export LANG="ja_JP.EUCJP"
python <..snipped..>

..

UnicodeEncodeError: 'euc_jp' codec can't encode character '\xab' in position 0: illegal multibyte sequence

So this is not an issue with Windows but with python3.

The script generates identical files with the both versions of
python with the pach on Linux and Windows 7. Python3 on Windows
emits CRLF as a new line but it doesn't seem to harm. (I didn't
confirmed that due to extreme slowness of build from uncertain
reasons now..)

I confirmed that CRLF actually doesn't harm and unaccent works
correctly. (t_isspace() excludes them as white space).

This patch contains irrelevant changes. The minimal required
change would be the attached. If you want refacotor the
UnicodeData reader or rearrange import sutff, it should be
separate patches.

It would be better use IOBase for Python3 especially for stdout
replacement but I didin't since it *is* working.

Everything else looks correct. I apologise for not having replied to your
question in the original bug report. I had intended to, but as I said,
there's been an increase in the things I need to juggle at the moment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Hugh Ranalli
hugh@whtc.ca
In reply to: Kyotaro HORIGUCHI (#6)
Re: Unaccent extension python script Issue in Windows

On Mon, 18 Mar 2019 at 01:14, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

This patch contains irrelevant changes. The minimal required
change would be the attached. If you want refacotor the
UnicodeData reader or rearrange import sutff, it should be
separate patches.

I'm not sure I'd classify the second change as "irrelevant." Using "with"
is the standard and recommended practice for working with files in Python.
At the moment the script does nothing to close the open data file, whether
through regular processing or in the case of an exception. I would argue
that's a bug and should be fixed. Creating a separate patch for that seems
to be adding work for no reason.

Hugh

#9Michael Paquier
michael@paquier.xyz
In reply to: Hugh Ranalli (#8)
Re: Unaccent extension python script Issue in Windows

On Mon, Mar 18, 2019 at 09:06:09AM -0400, Hugh Ranalli wrote:

I'm not sure I'd classify the second change as "irrelevant." Using "with"
is the standard and recommended practice for working with files in Python.

I honestly don't know about any standard way to do anythings in
Python, but it is true that using "with" saves from a forgotten
close() call.

At the moment the script does nothing to close the open data file, whether
through regular processing or in the case of an exception. I would argue
that's a bug and should be fixed. Creating a separate patch for that seems
to be adding work for no reason.

This script runs in a short-lived context, so it is really not a big
deal to not close the opened UnicodeData.txt. I agree that it is bad
practice though, so I think it's fine to fix the problem if there is
another patch touching the same area of the code while on it.
--
Michael

#10Alvaro Herrera from 2ndQuadrant
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#9)
1 attachment(s)
Re: Unaccent extension python script Issue in Windows

Thanks! I have pushed this patch. I didn't test on Windows, but I did
verify that it works with python2 and 3 on my Linux machine.

CLDR has made release 35 already, upon download of which the script
generates a few more lines in the unaccent.rules file, as attached.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

unaccent.difftext/x-diff; charset=utf-8Download
diff --git a/contrib/unaccent/unaccent.rules b/contrib/unaccent/unaccent.rules
index 99826408ac..bf4c1bd197 100644
--- a/contrib/unaccent/unaccent.rules
+++ b/contrib/unaccent/unaccent.rules
@@ -1107,6 +1107,7 @@
 ℕ	N
 №	No
 ℗	(P)
+℘	P
 ℙ	P
 ℚ	Q
 ℛ	R
@@ -1253,6 +1254,28 @@
 â©´	::=
 ⩵	==
 â©¶	===
+â± 	L
+ⱡ	l
+â±¢	L
+â±£	P
+Ɽ	R
+â±¥	a
+ⱦ	t
+â±§	H
+ⱨ	h
+Ⱪ	K
+ⱪ	k
+Ⱬ	Z
+ⱬ	z
+â±®	M
+â±±	v
+â±²	W
+â±³	w
+â±´	v
+ⱸ	e
+ⱺ	o
+â±¾	S
+Ɀ	Z
 、	,
 。	.
 〇	0
@@ -1349,6 +1372,82 @@
 ㏝	Wb
 ㏞	V/m
 ㏟	A/m
+ꜰ	F
+ꜱ	S
+Ꜳ	AA
+ꜳ	aa
+Ꜵ	AO
+ꜵ	ao
+Ꜷ	AU
+ꜷ	au
+Ꜹ	AV
+ꜹ	av
+Ꜻ	AV
+ꜻ	av
+Ꜽ	AY
+ꜽ	ay
+Ꝁ	K
+ꝁ	k
+Ꝃ	K
+ꝃ	k
+Ꝅ	K
+ꝅ	k
+Ꝇ	L
+ꝇ	l
+Ꝉ	L
+ꝉ	l
+Ꝋ	O
+ꝋ	o
+Ꝍ	O
+ꝍ	o
+Ꝏ	OO
+ꝏ	oo
+Ꝑ	P
+ꝑ	p
+Ꝓ	P
+ꝓ	p
+Ꝕ	P
+ꝕ	p
+Ꝗ	Q
+ꝗ	q
+Ꝙ	Q
+ꝙ	q
+Ꝟ	V
+ꝟ	v
+Ꝡ	VY
+ꝡ	vy
+Ꝥ	TH
+ꝥ	th
+Ꝧ	TH
+ꝧ	th
+ꝱ	d
+ꝲ	l
+ꝳ	m
+ꝴ	n
+ꝵ	r
+ꝶ	R
+ꝷ	t
+Ꝺ	D
+ꝺ	d
+Ꝼ	F
+ꝼ	f
+Ꞇ	T
+ꞇ	t
+Ꞑ	N
+ꞑ	n
+Ꞓ	C
+ꞓ	c
+Ꞡ	G
+ꞡ	g
+Ꞣ	K
+ꞣ	k
+Ꞥ	N
+ꞥ	n
+Ꞧ	R
+ꞧ	r
+Ꞩ	S
+ꞩ	s
+Ɦ	H
 ff	ff
 fi	fi
 fl	fl