bug in Google translate snippet

Started by Alvaro Herreraalmost 17 years ago5 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com

Hi,

I was having a look at this snippet:
http://wiki.postgresql.org/wiki/Google_Translate
and it turns out that it doesn't work if the result contains non-ASCII
chars. Does anybody know how to fix it?

alvherre=# select gtranslate('en', 'es', 'he');
ERROR: plpython: function "gtranslate" could not create return value
DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

By adding a plpy.log() call you can see that the answer is "�l":
LOG: (u'\xe9l',)

I guess it needs some treatment similar to the one in this function:
http://wiki.postgresql.org/wiki/Strip_accents_from_strings

For completeness, here is the code:

CREATE OR REPLACE FUNCTION gtranslate(src text, target text, phrase text) RETURNS text
LANGUAGE plpythonu
AS $$
import re
import urllib

import simplejson as json

class UrlOpener(urllib.FancyURLopener):
version = "py-gtranslate/1.0"

base_uri = "http://ajax.googleapis.com/ajax/services/language/translate&quot;
default_params = {'v': '1.0'}

def translate(src, to, phrase):
args = default_params.copy()
args.update({
'langpair': '%s%%7C%s' % (src, to),
'q': urllib.quote_plus(phrase),
})
argstring = '%s' % ('&'.join(['%s=%s' % (k,v) for (k,v) in args.iteritems()]))
resp = json.load(UrlOpener().open('%s?%s' % (base_uri, argstring)))
try:
return resp['responseData']['translatedText']
except:
# should probably warn about failed translation
return phrase

return translate(src, target, phrase)
$$;

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#1)
Re: bug in Google translate snippet

Alvaro Herrera wrote:

Hi,

I was having a look at this snippet:
http://wiki.postgresql.org/wiki/Google_Translate
and it turns out that it doesn't work if the result contains non-ASCII
chars. Does anybody know how to fix it?

alvherre=# select gtranslate('en', 'es', 'he');
ERROR: plpython: function "gtranslate" could not create return value
DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

This looks like a python issue rather than a Postgres issue. The problem
is probably in python-simplejson.

cheers

andrew

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#2)
Re: bug in Google translate snippet

Andrew Dunstan wrote:

Alvaro Herrera wrote:

Hi,

I was having a look at this snippet:
http://wiki.postgresql.org/wiki/Google_Translate
and it turns out that it doesn't work if the result contains non-ASCII
chars. Does anybody know how to fix it?

alvherre=# select gtranslate('en', 'es', 'he');
ERROR: plpython: function "gtranslate" could not create return value
DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

This looks like a python issue rather than a Postgres issue. The problem
is probably in python-simplejson.

I think the problem happens when the PL tries to create the output
value. Otherwise I wouldn't be able to see the value in plpy.log.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#4Jan Urbański
wulczer@wulczer.org
In reply to: Alvaro Herrera (#3)
Re: bug in Google translate snippet

Alvaro Herrera wrote:

Andrew Dunstan wrote:

Alvaro Herrera wrote:

Hi,

I was having a look at this snippet:
http://wiki.postgresql.org/wiki/Google_Translate
and it turns out that it doesn't work if the result contains non-ASCII
chars. Does anybody know how to fix it?

alvherre=# select gtranslate('en', 'es', 'he');
ERROR: plpython: function "gtranslate" could not create return value
DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

This looks like a python issue rather than a Postgres issue. The problem
is probably in python-simplejson.

I think the problem happens when the PL tries to create the output
value. Otherwise I wouldn't be able to see the value in plpy.log.

The problem is that the thing you are trying to return
(resp['responseData']['translatedText']) is a Unicode object, so you
can't just print it. The error comes from Python complaining that you
are trying to output an 8-bit character using the 'ascii' codec, that
cannot encode that.

One solution is to explicitly encode the Unicode string with some codec,
that is: ask Python to convert the Unicode object into a blob using some
serialization method, UTF-8 being a good method here. For instance
return resp['responseData']['translatedText'].encode('utf-8')
worked for me.

See also http://docs.python.org/tutorial/introduction.html#unicode-strings

Cheers,
Jan

#5Jan Urbański
wulczer@wulczer.org
In reply to: Alvaro Herrera (#3)
Re: bug in Google translate snippet

Alvaro Herrera wrote:

Andrew Dunstan wrote:

Alvaro Herrera wrote:

Hi,

I was having a look at this snippet:
http://wiki.postgresql.org/wiki/Google_Translate
and it turns out that it doesn't work if the result contains non-ASCII
chars. Does anybody know how to fix it?

alvherre=# select gtranslate('en', 'es', 'he');
ERROR: plpython: function "gtranslate" could not create return value
DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

This looks like a python issue rather than a Postgres issue. The problem
is probably in python-simplejson.

I think the problem happens when the PL tries to create the output
value. Otherwise I wouldn't be able to see the value in plpy.log.

The problem is that the thing you are trying to return
(resp['responseData']['translatedText']) is a Unicode object, so you
can't just print it. The error comes from Python complaining that you
are trying to output an 8-bit character using the 'ascii' codec, that
cannot encode that.

One solution is to explicitly encode the Unicode string with some codec,
that is: ask Python to convert the Unicode object into a blob using some
serialization method, UTF-8 being a good method here. For instance
return resp['responseData']['translatedText'].encode('utf-8')
worked for me.

See also http://docs.python.org/tutorial/introduction.html#unicode-strings

Cheers,
Jan