Searching for "bare" letters

Started by Reuven M. Lernerover 14 years ago12 messagesgeneral
Jump to latest
#1Reuven M. Lerner
reuven@lerner.co.il

<html style="direction: ltr;">
<head>

<meta http-equiv="content-type" content="text/html; charset=UTF-8"><style>body
p { margin-bottom: 10pt; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="UTF-8" bgcolor="#FFFFFF"
text="#000000">
<p>Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon
to be upgraded to 9.1, given that we haven't yet launched).  The
project will involve numerous text fields containing English,
Spanish, and Portuguese.  Some of those text fields will be
searchable by the user.  That's easy enough to do; for our
purposes, I was planning to use some combination of LIKE searches;
the database is small enough that this doesn't take very much
time, and we don't expect the number of searchable records (or
columns within those records) to be all that large.</p>
<p>The thing is, the people running the site want searches to work
on what I'm calling (for lack of a better term) "bare" letters. 
That is, if the user searches for "n", then the search should also
match Spanish words containing "ñ".  I'm told by Spanish-speaking
members of the team that this is how they would expect searches to
work.  However, when I just did a quick test using a UTF-8 encoded
9.0 database, I found that PostgreSQL didn't  see the two
characters as identical.  (I must say, this is the behavior that I
would have expected, had the Spanish-speaking team member not said
anything on the subject.)</p>
<p>So my question is whether I can somehow wrangle PostgreSQL into
thinking that "n" and "ñ" are the same character for search
purposes, or if I need to do something else -- use regexps, keep a
"naked," searchable version of each column alongside the native
one, or something else entirely -- to get this to work.</p>
<p>Any ideas?</p>
<p>Thanks,</p>
<p>Reuven<br>
</p>
<p> </p>
<pre class="moz-signature" cols="72">--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner
</pre>
</body>
</html>

#2planas
jslozier@gmail.com
In reply to: Reuven M. Lerner (#1)
Re: Searching for "bare" letters

On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:

Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be
upgraded to 9.1, given that we haven't yet launched). The project
will involve numerous text fields containing English, Spanish, and
Portuguese. Some of those text fields will be searchable by the user.
That's easy enough to do; for our purposes, I was planning to use some
combination of LIKE searches; the database is small enough that this
doesn't take very much time, and we don't expect the number of
searchable records (or columns within those records) to be all that
large.

The thing is, the people running the site want searches to work on
what I'm calling (for lack of a better term) "bare" letters. That is,
if the user searches for "n", then the search should also match
Spanish words containing "ñ". I'm told by Spanish-speaking members of
the team that this is how they would expect searches to work.
However, when I just did a quick test using a UTF-8 encoded 9.0
database, I found that PostgreSQL didn't see the two characters as
identical. (I must say, this is the behavior that I would have
expected, had the Spanish-speaking team member not said anything on
the subject.)

So my question is whether I can somehow wrangle PostgreSQL into
thinking that "n" and "ñ" are the same character for search purposes,
or if I need to do something else -- use regexps, keep a "naked,"
searchable version of each column alongside the native one, or
something else entirely -- to get this to work.

Could you parse the search string for the non-English characters and
convert them to the appropriate English character? My skills are not
that good or I would offer more details.

Any ideas?

Thanks,

Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

--
Jay Lozier
jslozier@gmail.com

#3Uwe C. Schroeder
uwe@oss4u.com
In reply to: Reuven M. Lerner (#1)
Re: Searching for "bare" letters

Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be
upgraded to 9.1, given that we haven't yet launched). The project will
involve numerous text fields containing English, Spanish, and Portuguese.
Some of those text fields will be searchable by the user. That's easy
enough to do; for our purposes, I was planning to use some combination of
LIKE searches; the database is small enough that this doesn't take very
much time, and we don't expect the number of searchable records (or
columns within those records) to be all that large. The thing is, the
people running the site want searches to work on what I'm calling (for
lack of a better term) "bare" letters. That is, if the user searches for
"n", then the search should also match Spanish words containing "ñ". I'm
told by Spanish-speaking members of the team that this is how they would
expect searches to work. However, when I just did a quick test using a
UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two
characters as identical. (I must say, this is the behavior that I would
have expected, had the Spanish-speaking team member not said anything on
the subject.) So my question is whether I can somehow wrangle PostgreSQL
into thinking that "n" and "ñ" are the same character for search purposes,
or if I need to do something else -- use regexps, keep a "naked,"
searchable version of each column alongside the native one, or something
else entirely -- to get this to work. Any ideas?
Thanks,
Reuven

What kind of "client" are the users using? I assume you will have some kind
of user interface. For me this is a typical job for a user interface. The
number of letters with "equivalents" in different languages are extremely
limited, so a simple matching routine in the user interface should give you a
way to issue the proper query.

Uwe

#4Cody Caughlan
toolbag@gmail.com
In reply to: planas (#2)
Re: Searching for "bare" letters

One approach would be to "normalize" all the text and search against that.

That is, basically convert all non-ASCII characters to their equivalents.

I've had to do this in Solr for searching for the exact reasons you've outlined: treat "ñ" as "n". Ditto for "ü" -> "u", "é" => "e", etc.

This is easily done in Solr via the included ASCIIFoldingFilterFactory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

You could look at the code to see how they do the conversion and implement it.

/Cody

On Oct 1, 2011, at 7:09 PM, planas wrote:

Show quoted text

On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:

Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large.

The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) "bare" letters. That is, if the user searches for "n", then the search should also match Spanish words containing "ñ". I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)

So my question is whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes, or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside the native one, or something else entirely -- to get this to work.

Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details.

Any ideas?

Thanks,

Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

--
Jay Lozier
jslozier@gmail.com

#5Reuven M. Lerner
reuven@lerner.co.il
In reply to: Uwe C. Schroeder (#3)
Re: Searching for "bare" letters

Hi, everyone. Uwe wrote:

What kind of "client" are the users using? I assume you will have some kind
of user interface. For me this is a typical job for a user interface. The
number of letters with "equivalents" in different languages are extremely
limited, so a simple matching routine in the user interface should give you a
way to issue the proper query.

The user interface will be via a Web application. But we need to store
the data with the European characters, such as ñ, so that we can display
them appropriately. So much as I like your suggestion, we need to do
the opposite of what you're saying -- namely, take a bare letter, and
then search for letters with accents and such on them.

I am beginning to think that storing two versions of each name, one bare
and the other not, might be the easiest way to go. But hey, I'm open
to more suggestions.

Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

#6Noname
hamann.w@t-online.de
In reply to: Reuven M. Lerner (#1)
Re: Searching for "bare" letters

Reuven M. Lerner wrote:

<p>Hi, everyone.� I'm working on a project on PostgreSQL 9.0 (soon
to be upgraded to 9.1, given that we haven't yet launched).� The
project will involve numerous text fields containing English,
Spanish, and Portuguese.� Some of those text fields will be
searchable by the user.� That's easy enough to do; for our
purposes, I was planning to use some combination of LIKE searches;
the database is small enough that this doesn't take very much
time, and we don't expect the number of searchable records (or
columns within those records) to be all that large.</p>
<p>The thing is, the people running the site want searches to work
on what I'm calling (for lack of a better term) "bare" letters.�
That is, if the user searches for "n", then the search should also
match Spanish words containing "�".� I'm told by Spanish-speaking
members of the team that this is how they would expect searches to
work.� However, when I just did a quick test using a UTF-8 encoded
9.0 database, I found that PostgreSQL didn't� see the two
characters as identical.� (I must say, this is the behavior that I
would have expected, had the Spanish-speaking team member not said
anything on the subject.)</p>
<p>So my question is whether I can somehow wrangle PostgreSQL into
thinking that "n" and "�" are the same character for search
purposes, or if I need to do something else -- use regexps, keep a
"naked," searchable version of each column alongside the native
one, or something else entirely -- to get this to work.</p>
<p>Any ideas?</p>
<p>Thanks,</p>
<p>Reuven<br>

I had the same problem with german (there is � � �)
I ended up with a normalized version of the database (for many purposes, this could
be just an extra column) plus preprocessing the input.
There is one difficulty with german searches: these letters are commonly transliterated into
ue etc, like in "Muenchen". So depending on culture, some people would expect a "u" search
term to match, and others the "ue". So preprocessing query means replacing bare u
(not followed by e) with a ue? regex

BTW: if your search form does not explicitly tell the browser to use utf8 to encode the search field,
you might expect a small proportion of iso-latin1 requests

Regards
Wolfgang

#7Uwe C. Schroeder
uwe@oss4u.com
In reply to: Reuven M. Lerner (#5)
Re: Searching for "bare" letters

Hi, everyone. Uwe wrote:

What kind of "client" are the users using? I assume you will have some
kind of user interface. For me this is a typical job for a user
interface. The number of letters with "equivalents" in different
languages are extremely limited, so a simple matching routine in the
user interface should give you a way to issue the proper query.

The user interface will be via a Web application. But we need to store
the data with the European characters, such as ñ, so that we can display
them appropriately. So much as I like your suggestion, we need to do
the opposite of what you're saying -- namely, take a bare letter, and
then search for letters with accents and such on them.

I am beginning to think that storing two versions of each name, one bare
and the other not, might be the easiest way to go. But hey, I'm open
to more suggestions.

Reuven

That still doesn't hinder you from using a matching algorithm. Here a simple
example (to my understanding of the problem)
You have texts stored in the db both containing a n and a ñ. Now a client
enters "n" on the website. What you want to do is look for both variations, so
"n" translates into "n" or "ñ".
There you have it. In the routine that receives the request you have a
matching method that matches on "n" (or any of the few other characters with
equivalents) and the routine will issue a query with a "xx like "%n%" or xx
like "%ñ%" (personally I would use ilike, since that eliminates the case
problem).

Since you're referring to a "name", I sure don't know the specifics of the
problem or data layout, but by what I know I think you can tackle this with a
rather primitive "match -> translate to" kind of algorithm.

One thing I'd not do: store duplicate versions. There's always a way to deal
with data the way it is. In my opinion storing different versions of the same
data just bloats a database in favor of a smarter way to deal with the initial
data.

Uwe

#8Oleg Bartunov
oleg@sai.msu.su
In reply to: Uwe C. Schroeder (#7)
Re: Searching for "bare" letters

I don't see the problem - you can have a dictionary, which does all work on
recognizing bare letters and output several versions. Have you seen unaccent
dictionary ?

Oleg
On Sun, 2 Oct 2011, Uwe Schroeder wrote:

Hi, everyone. Uwe wrote:

What kind of "client" are the users using? I assume you will have some
kind of user interface. For me this is a typical job for a user
interface. The number of letters with "equivalents" in different
languages are extremely limited, so a simple matching routine in the
user interface should give you a way to issue the proper query.

The user interface will be via a Web application. But we need to store
the data with the European characters, such as ?, so that we can display
them appropriately. So much as I like your suggestion, we need to do
the opposite of what you're saying -- namely, take a bare letter, and
then search for letters with accents and such on them.

I am beginning to think that storing two versions of each name, one bare
and the other not, might be the easiest way to go. But hey, I'm open
to more suggestions.

Reuven

That still doesn't hinder you from using a matching algorithm. Here a simple
example (to my understanding of the problem)
You have texts stored in the db both containing a n and a ?. Now a client
enters "n" on the website. What you want to do is look for both variations, so
"n" translates into "n" or "?".
There you have it. In the routine that receives the request you have a
matching method that matches on "n" (or any of the few other characters with
equivalents) and the routine will issue a query with a "xx like "%n%" or xx
like "%?%" (personally I would use ilike, since that eliminates the case
problem).

Since you're referring to a "name", I sure don't know the specifics of the
problem or data layout, but by what I know I think you can tackle this with a
rather primitive "match -> translate to" kind of algorithm.

One thing I'd not do: store duplicate versions. There's always a way to deal
with data the way it is. In my opinion storing different versions of the same
data just bloats a database in favor of a smarter way to deal with the initial
data.

Uwe

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#9Reuven M. Lerner
reuven@lerner.co.il
In reply to: Oleg Bartunov (#8)
Re: Searching for "bare" letters

Hi, Oleg. You wrote:

I don't see the problem - you can have a dictionary, which does all work
on recognizing bare letters and output several versions. Have you seen
unaccent
dictionary ?

This seems to be the direction that everyone is suggesting, and I'm
quite grateful for that. (I really hadn't ever needed to deal with such
issues in the past, having worked mostly with English and Hebrew, which
don't have such accent marks.)

As for the unaccent dictionary, I hadn't heard of it before, but just
saw it now in contrib, and it looks like it might fit perfectly. I'll
take a look; thanks for the suggestion.

Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

#10Mike Christensen
mike@kitchenpc.com
In reply to: Reuven M. Lerner (#9)
Re: Searching for "bare" letters

I don't see the problem - you can have a dictionary, which does all work
on recognizing bare letters and output several versions. Have you seen
unaccent
dictionary ?

This seems to be the direction that everyone is suggesting, and I'm quite
grateful for that.  (I really hadn't ever needed to deal with such issues in
the past, having worked mostly with English and Hebrew, which don't have
such accent marks.)

As for the unaccent dictionary, I hadn't heard of it before, but just saw it
now in contrib, and it looks like it might fit perfectly.  I'll take a look;
thanks for the suggestion.

I wrote this code for something similar I was doing, feel free to rip
it off or copy the regular expressions:

input = Regex.Replace(input, @"[\xC0-\xC5\xE0-\xE5]", "a"); //Replace with "a"
input = Regex.Replace(input, @"[\xC8-\xCB\xE8-\xEB]", "e"); //Replace with "e"
input = Regex.Replace(input, @"[\xCC-\xCF\xEC-\xEF]", "i"); //Replace with "i"
input = Regex.Replace(input, @"[\xD1\xF1]", "n"); //Replace with "n"
input = Regex.Replace(input, @"[\xD2-\xD6\xF2-\xF6]", "o"); //Replace with "o"
input = Regex.Replace(input, @"[\xD9-\xDC\xF9-\xFC]", "u"); //Replace with "u"
input = Regex.Replace(input, @"[\xDD\xDF\xFF]", "y"); //Replace with "y"

#11Eduardo Morras
nec556@retena.com
In reply to: Reuven M. Lerner (#1)
Re: Searching for "bare" letters

At 01:25 02/10/2011, Reuven M. Lerner wrote:

Hi, everyone. I'm working on a project on
PostgreSQL 9.0 (soon to be upgraded to 9.1,
given that we haven't yet launched). The
project will involve numerous text fields
containing English, Spanish, and
Portuguese. Some of those text fields will be
searchable by the user. That's easy enough to
do; for our purposes, I was planning to use some
combination of LIKE searches; the database is
small enough that this doesn't take very much
time, and we don't expect the number of
searchable records (or columns within those records) to be all that large.

The thing is, the people running the site want
searches to work on what I'm calling (for lack
of a better term) "bare" letters. That is, if
the user searches for "n", then the search
should also match Spanish words containing
"ñ". I'm told by Spanish-speaking members of
the team that this is how they would expect
searches to work. However, when I just did a
quick test using a UTF-8 encoded 9.0 database, I
found that PostgreSQL didn't see the two
characters as identical. (I must say, this is
the behavior that I would have expected, had the
Spanish-speaking team member not said anything on the subject.)

So my question is whether I can somehow wrangle
PostgreSQL into thinking that "n" and "ñ" are
the same character for search purposes, or if I
need to do something else -- use regexps, keep a
"naked," searchable version of each column
alongside the native one, or something else entirely -- to get this to work.

Any ideas?

You can use perceptual hashing for that. There
are multiple algorithms, some of them can be tuned for specific languages.

See this documentation:

http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,

http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,

http://en.wikipedia.org/wiki/Metaphone is a
family of several modern algorithms.

Remember that they are hashing algorithms, some
words can collide because they have the same pronunciation but write different.

I remember that datapark search engine uses them
with dictionaries. You can check it too.

http://www.dataparksearch.org/

Thanks,

Reuven

HTH

#12Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Reuven M. Lerner (#9)
Re: Searching for "bare" letters

On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote:

quite grateful for that. (I really hadn't ever needed to deal with
such issues in the past, having worked mostly with English and
Hebrew, which don't have such accent marks.)

That isn't quite true about English. We have words like coöperate and
naïve. The former is sometimes fixed with a hyphen instead, but the
latter can't be.

I think what happened is that English speakers, because we're already
used to being sloppy (you can't tell what's a subjunctive in English,
either, just by looking) were willing to adapt our spelling to reflect
the limitations of typewriters. Also, English never really had an
official standard spelling -- by the time the English were attempting
to standardize seriously, there was already an American branch with
its own Bossypants Official Reformer of Spelling ("BORS", which in
that case was Noah Webster. See G.B. Shaw for a British example). So
we mostly lost the accents in standard spelling. We also lost various
standard digraphs, like that in encyclopædia (which, depending on
which branch of nonsense you subscribe to, can be spelled instead
"encyclopedia" or "encyclopaedia"; both would have been called "wrong"
once upon a time).

As for the unaccent dictionary, I hadn't heard of it before, but
just saw it now in contrib, and it looks like it might fit
perfectly. I'll take a look; thanks for the suggestion.

The big problem there is what someone else pointed to up-thread: in
some languages, the natural thing to do is to transliterate using
multiple characters. The usual example is that in German is it common
to use "e" after a vowel to approximate the umlaut. So, "ö" becomes
"oe". Unfortunately, in Swedish this is clearly a mistake, and if you
can't use the diaeresis, then you just use the "undecorated" character
instead. The famous Swedish ship called the Götheborg cannot be
transliterated as Goetheborg. Even in German, the rule is
complicated, because it's not two-way: you can't spell the famous
writer's name Göthe (even though Google seems to think you can).

As far as I can tell, the unaccent dictionary doesn't handle the
two-character case, though it sure looks like it could be extended to
do it. But it doesn't seem to have a facility for differentiating
based on the language of the string. I don't know whether that could
be added.

The upshot is that, if you need to store multilingual input and do
special handling on the strings afterwards, you are wise to store the
string with a language tag so that you can apply the right rules later
on. See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some
pointers. If just "stripping accents" is good enough for you, then
the unaccent dictionary will probably be good enough.

A

--
Andrew Sullivan
ajs@crankycanuck.ca