Automatic locale detection?

Started by Matthew Peterover 19 years ago3 messagesgeneral
Jump to latest
#1Matthew Peter
survivedsushi@yahoo.com

Is it possible to automatically detect the language encoding of incoming data? For instance if Japanese is used, is there a way to know it is Japanese from a bit in the charset, a dictionary-based evaluation or otherwise?

---------------------------------
All-new Yahoo! Mail - Fire up a more powerful email and get things done faster.

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Matthew Peter (#1)
Re: Automatic locale detection?

On Sun, Oct 08, 2006 at 12:04:01PM -0700, Matthew Peter wrote:

Is it possible to automatically detect the language encoding of
incoming data? For instance if Japanese is used, is there a way to
know it is Japanese from a bit in the charset, a dictionary-based
evaluation or otherwise?

While technically possible, do you really want to run the risk of
getting it wrong?

Secondly, if you don't know the encoding of your data, you've got a
security problem, since you can't safely escape the data.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Lexington Luthor
Lexington.Luthor@gmail.com
In reply to: Matthew Peter (#1)
Re: Automatic locale detection?

Matthew Peter wrote:

Is it possible to automatically detect the language encoding of incoming
data? For instance if Japanese is used, is there a way to know it is
Japanese from a bit in the charset, a dictionary-based evaluation or
otherwise?

Have a look at http://www.mozilla.org/projects/intl/chardet.html and
http://chardet.feedparser.org/ for some implementations of this idea.

These detectors are often inaccurate though (and sometimes fail
completely), see the warning at the bottom of
http://chardet.feedparser.org/docs/supported-encodings.html

Regards,
LL