Full text search strategy for names

Started by Rick Schumeyeralmost 17 years ago7 messagesgeneral
Jump to latest
#1Rick Schumeyer
rschumeyer@gmail.com

I want to be able to search a list of articles for title words as well as
author names. I understand how to do the title words with the full text
searching. But I'm not sure the best strategy for the names. The full text
parser "parses" the names giving undesirable results.

For example,

select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1

Is there a way to tell the parser to index the words in a column without
trying to determine word roots?

Or what is the best way to index names for efficient searching?

#2Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Rick Schumeyer (#1)
Re: Full text search strategy for names

Rick Schumeyer wrote:

I want to be able to search a list of articles for title words as well
as author names.... I'm not sure the best strategy for the names. The
full text parser "parses" the names giving undesirable results.

For example,

select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1

Is there a way to tell the parser to index the words in a column without
trying to determine word roots?

Or what is the best way to index names for efficient searching?

I've got a similar question; but would love a dictionary that
could give me an efficient index that considers that
Bill/William and Bob/Rob/Robert and Khadaffi/Qaddafi might
share the same roots. Ideally it'd return exact matches
first, followed by the similar terms.

I kludged up some many step queries to try to do this; but
wonder if this would work better as a tsearch dictionary,
and wonder even more if I'm re-inventing something that's
already out there.

Anyone know of such a tsearch dictionary that is aware of
the roots of names?

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Rick Schumeyer (#1)
Re: Full text search strategy for names

Rick Schumeyer <rschumeyer@gmail.com> writes:

I want to be able to search a list of articles for title words as well as
author names. I understand how to do the title words with the full text
searching. But I'm not sure the best strategy for the names. The full text
parser "parses" the names giving undesirable results.

For example,

select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1

Er ... why is that a problem? As long as the index stems the words in
the same way that the query does, why should you care?

regards, tom lane

#4Rick Schumeyer
rschumeyer@gmail.com
In reply to: Tom Lane (#3)
Re: Full text search strategy for names

You can get extra (undesirable) results, depending on the name. For
example, if you are searching for the last name of "Ricks", you will also
find all authors whose first name is "Rick".

I also noticed that the directions for indexing multiple columns don't seem
to be quite right.

In section 12.2.2:

UPDATE pgweb SET textsearchable_index_col =
to_tsvector('english', coalesce(title,'') || coalesce(body,''));

I found that the last word of title is joined with the first word of body,
which gives strange results. I ended up added a space which gave better
results:

to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));

On Fri, Apr 17, 2009 at 1:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Rick Schumeyer <rschumeyer@gmail.com> writes:

I want to be able to search a list of articles for title words as well as
author names. I understand how to do the title words with the full text
searching. But I'm not sure the best strategy for the names. The full

text

parser "parses" the names giving undesirable results.

For example,

select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1

Er ... why is that a problem? As long as the index stems the words in
the same way that the query does, why should you care?

regards, tom lane

#5Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Rick Schumeyer (#4)
Re: Full text search strategy for names

so it looks to me, like you are trying to use wrong tool for the job.
Why not just normalize names to #3 ?

#6John DeSoi
desoi@pgedit.com
In reply to: Rick Schumeyer (#4)
Re: Full text search strategy for names

On Apr 17, 2009, at 7:02 AM, Rick Schumeyer wrote:

You can get extra (undesirable) results, depending on the name. For
example, if you are searching for the last name of "Ricks", you will
also find all authors whose first name is "Rick"

If you can process the names separately from the rest of the text, try

select to_tsvector('simple', 'claude Jones');

to_tsvector
----------------------
'jones':2 'claude':1

John DeSoi, Ph.D.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Rick Schumeyer (#4)
Re: Full text search strategy for names

Rick Schumeyer <rschumeyer@gmail.com> writes:

I also noticed that the directions for indexing multiple columns don't seem
to be quite right.

In section 12.2.2:

UPDATE pgweb SET textsearchable_index_col =
to_tsvector('english', coalesce(title,'') || coalesce(body,''));

Yeah, this would be better done with a space in the middle. I've
corrected the documentation --- thanks for the comment!

(Note that you could do it in other ways that don't require this,
ie feed the columns to to_tsvector() separately; but in the terms of
this particular example, you do want a space to avoid running words
together.)

regards, tom lane