fuzzystrmatch module buggy? observations

Started by ERR ORRover 13 years ago2 messagesbugsgeneral
Jump to latest
#1ERR ORR
rd0002@gmail.com
bugsgeneral

The fuzzystrmatch module (
http://www.postgresql.org/docs/9.2/static/fuzzystrmatch.html) is currently,
as of 9.2.1, documented with the caution *"At present, the soundex,
metaphone, dmetaphone, and dmetaphone_alt functions do not work well with
multibyte encodings (such as UTF-8)"*.

While the venerable algorithms contained in the module *seem* to generally
work for Latin strings from European languages which all have
accented/diacritic characters such as äöüñáéíóúàèìòù, for languages with
non-Latin characters such as Kyrillic, Hebrew, Arabic, Chinese, these
venerable algorithms return NULL (empty) or plain weirdness.

Some examples:

dmetaphone ('Новости') = 'NN'
soundex ('Новости') = NULL

dmetaphone ('לפחות') = NULL
soundex ('לפחות') = NULL

soundex ('相关搜索') = NULL
dmetaphone ('相关搜索') = NULL

metaphone() crashes with SQL state: 42883 for all these strings (it tells
me I should cast the 'unknown' input).

The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(),
dmetaphone_alt, soundex() to fail.

Only levenshtein() appears to function correctly with all above inputs,
even when I let it compare Hebrew against Chinese strings.

Summarizing my experience:
* for english (ASCII equivalent), the module works,
* for the rest of the Latin charsets (equivalent to ISO 8859-x) the module
works unreliably,
* for non-latin chars (UTF8 with 2-4 bytes per char) the module does not
work

Note: My DB and the OS are set up for UTF-8.

This would appear to be less a problem of Postgresql and the fuzzystrmach
module itself but because there
appear to exist no replacement algorithms adequate for a multilingual world
- at least that is my impression
after looking at the IPA and http://www.lt-world.org websites and branching
out from there.

Given all this I have no idea of this is a bug at all or the
state-of-the-art around this topic is inadequate.

Questions (to the developers):
- Is there anything in work or planned for the fuzzystrmatch module?
- Does anybody know about adequate replacements or upgrades of the soundex,
metaphone etc. algorithms from academia?

#2Bruce Momjian
bruce@momjian.us
In reply to: ERR ORR (#1)
bugsgeneral
Re: [GENERAL] fuzzystrmatch module buggy? observations

On Tue, Oct 30, 2012 at 02:29:09PM +0100, r d wrote:

The fuzzystrmatch module (http://www.postgresql.org/docs/9.2/static/
fuzzystrmatch.html) is currently, as of 9.2.1, documented with the caution "At
present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do
not work well with multibyte encodings (such as UTF-8)". 

While the venerable algorithms contained in the module seem to generally work
for Latin strings from European languages which all have accented/diacritic
characters such as äöüñáéíóúàèìòù, for languages with non-Latin characters such
as Kyrillic, Hebrew, Arabic, Chinese, these venerable algorithms return NULL
(empty) or plain weirdness. 

Some examples:

dmetaphone ('Новости') = 'NN'
soundex ('Новости') = NULL

dmetaphone ('לפחות') = NULL
soundex ('לפחות') = NULL

soundex ('相关搜索') = NULL
dmetaphone ('相关搜索') = NULL

metaphone() crashes with SQL state: 42883 for all these strings (it tells me I
should cast the 'unknown' input).

The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(), dmetaphone_alt,
soundex() to fail.

Only levenshtein() appears to function correctly with all above inputs, even
when I let it compare Hebrew against Chinese strings.

Summarizing my experience:
* for english (ASCII equivalent), the module works, 
* for the rest of the Latin charsets (equivalent to ISO 8859-x) the module
works unreliably,
* for non-latin chars (UTF8 with 2-4 bytes per char) the module does not work

Note: My DB and the OS are set up for UTF-8.

This would appear to be less a problem of Postgresql and the fuzzystrmach
module itself but because there
appear to exist no replacement algorithms adequate for a multilingual world -
at least that is my impression 
after looking at the IPA and http://www.lt-world.org websites and branching out
from there.

This is a very good summary. I was not aware of all these behaviors.

Given all this I have no idea of this is a bug at all or the state-of-the-art
around this topic is inadequate.

I have no idea either.

Questions (to the developers):
- Is there anything in work or planned for the fuzzystrmatch module?
- Does anybody know about adequate replacements or upgrades of the soundex,
metaphone etc. algorithms from academia?

I have not heard of anyone working in this area. What usually happens
is some expert in the field shows up and submits a patch to improve it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +