Full text search strategy for names
I want to be able to search a list of articles for title words as well as
author names. I understand how to do the title words with the full text
searching. But I'm not sure the best strategy for the names. The full text
parser "parses" the names giving undesirable results.
For example,
select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1
Is there a way to tell the parser to index the words in a column without
trying to determine word roots?
Or what is the best way to index names for efficient searching?
Rick Schumeyer wrote:
I want to be able to search a list of articles for title words as well
as author names.... I'm not sure the best strategy for the names. The
full text parser "parses" the names giving undesirable results.For example,
select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1Is there a way to tell the parser to index the words in a column without
trying to determine word roots?
Or what is the best way to index names for efficient searching?
I've got a similar question; but would love a dictionary that
could give me an efficient index that considers that
Bill/William and Bob/Rob/Robert and Khadaffi/Qaddafi might
share the same roots. Ideally it'd return exact matches
first, followed by the similar terms.
I kludged up some many step queries to try to do this; but
wonder if this would work better as a tsearch dictionary,
and wonder even more if I'm re-inventing something that's
already out there.
Anyone know of such a tsearch dictionary that is aware of
the roots of names?
Rick Schumeyer <rschumeyer@gmail.com> writes:
I want to be able to search a list of articles for title words as well as
author names. I understand how to do the title words with the full text
searching. But I'm not sure the best strategy for the names. The full text
parser "parses" the names giving undesirable results.
For example,
select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1
Er ... why is that a problem? As long as the index stems the words in
the same way that the query does, why should you care?
regards, tom lane
You can get extra (undesirable) results, depending on the name. For
example, if you are searching for the last name of "Ricks", you will also
find all authors whose first name is "Rick".
I also noticed that the directions for indexing multiple columns don't seem
to be quite right.
In section 12.2.2:
UPDATE pgweb SET textsearchable_index_col =
to_tsvector('english', coalesce(title,'') || coalesce(body,''));
I found that the last word of title is joined with the first word of body,
which gives strange results. I ended up added a space which gave better
results:
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
On Fri, Apr 17, 2009 at 1:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Rick Schumeyer <rschumeyer@gmail.com> writes:
I want to be able to search a list of articles for title words as well as
author names. I understand how to do the title words with the full text
searching. But I'm not sure the best strategy for the names. The fulltext
parser "parses" the names giving undesirable results.
For example,
select to_tsvector('claude Jones');
to_tsvector
--------------------
'jone':2 'claud':1Er ... why is that a problem? As long as the index stems the words in
the same way that the query does, why should you care?regards, tom lane
so it looks to me, like you are trying to use wrong tool for the job.
Why not just normalize names to #3 ?
On Apr 17, 2009, at 7:02 AM, Rick Schumeyer wrote:
You can get extra (undesirable) results, depending on the name. For
example, if you are searching for the last name of "Ricks", you will
also find all authors whose first name is "Rick"
If you can process the names separately from the rest of the text, try
select to_tsvector('simple', 'claude Jones');
to_tsvector
----------------------
'jones':2 'claude':1
John DeSoi, Ph.D.
Rick Schumeyer <rschumeyer@gmail.com> writes:
I also noticed that the directions for indexing multiple columns don't seem
to be quite right.
In section 12.2.2:
UPDATE pgweb SET textsearchable_index_col =
to_tsvector('english', coalesce(title,'') || coalesce(body,''));
Yeah, this would be better done with a space in the middle. I've
corrected the documentation --- thanks for the comment!
(Note that you could do it in other ways that don't require this,
ie feed the columns to to_tsvector() separately; but in the terms of
this particular example, you do want a space to avoid running words
together.)
regards, tom lane