tsearch2 and hyphenated terms

Started by Reece Hartabout 18 years ago4 messagesgeneral

reece@harts.net

about 18 years ago

I'd like to use tsearch2 to index protein and gene names. Unfortunately,
such names are written inconsistently and sometimes with hyphens. For
example, MCL-1 and MCL1 are semantically equivalent but with the default
parser and to_tsvector, I see this:

unison@u8.3=> select to_tsvector('MCL1 MCL-1');
to_tsvector
-------------------------
'-1':3 'mcl':2 'mcl1':1

For the purposes of indexing these names, I suspect I'd get the majority
of cases by removing a hyphen when it's followed by 1 or 2 chars from
[a-zA-Z0-9]. Does that require a custom parser?

Thanks,
Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Reece Hart (#1)

Re: tsearch2 and hyphenated terms

Reece Hart <reece@harts.net> writes:

For the purposes of indexing these names, I suspect I'd get the majority
of cases by removing a hyphen when it's followed by 1 or 2 chars from
[a-zA-Z0-9]. Does that require a custom parser?

Yeah, looks like it:

regression=# select * from ts_debug('MCL1 MCL-1');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+-------+----------------+--------------+---------
numword | Word, letters and digits | MCL1 | {simple} | simple | {mcl1}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | MCL | {english_stem} | english_stem | {mcl}
int | Signed integer | -1 | {simple} | simple | {-1}
(4 rows)

I had thought you might get a "numhword" output, but that only seems to
happen if there's at least one letter after the dash:

regression=# select * from ts_debug('MCL1 MCL-X1');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+------------------------------------------+--------+----------------+--------------+----------
numword | Word, letters and digits | MCL1 | {simple} | simple | {mcl1}
blank | Space symbols | | {} | |
numhword | Hyphenated word, letters and digits | MCL-X1 | {simple} | simple | {mcl-x1}
hword_asciipart | Hyphenated word part, all ASCII | MCL | {english_stem} | english_stem | {mcl}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | X1 | {simple} | simple | {x1}
(6 rows)

regards, tom lane

Oleg Bartunov

oleg@sai.msu.su

about 18 years ago

In reply to: Reece Hart (#1)

Re: tsearch2 and hyphenated terms

We have the same problem with names in astronomy, so we implemented
dict_regex http://vo.astronet.ru/arxiv/dict_regex.html
Check it out !

Oleg
On Thu, 10 Apr 2008, Reece Hart wrote:

I'd like to use tsearch2 to index protein and gene names. Unfortunately,
such names are written inconsistently and sometimes with hyphens. For
example, MCL-1 and MCL1 are semantically equivalent but with the default
parser and to_tsvector, I see this:

unison@u8.3=> select to_tsvector('MCL1 MCL-1');
to_tsvector
-------------------------
'-1':3 'mcl':2 'mcl1':1

For the purposes of indexing these names, I suspect I'd get the majority
of cases by removing a hyphen when it's followed by 1 or 2 chars from
[a-zA-Z0-9]. Does that require a custom parser?

Thanks,
Reece

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Reece Hart

reece@harts.net

about 18 years ago

In reply to: Oleg Bartunov (#3)

Re: tsearch2 and hyphenated terms

On Fri, 2008-04-11 at 22:07 +0400, Oleg Bartunov wrote:

We have the same problem with names in astronomy, so we implemented
dict_regex http://vo.astronet.ru/arxiv/dict_regex.html
Check it out !

Oleg-

This gets me a lot closer. Thank you. I have two remaining problems.

The first problem is that 'bcl-w' and 'bcl-2' are parsed differently,
like so:

unison@u8.3=> select * from ts_debug('english','bcl-w');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------+----------------+--------------+---------
asciihword | Hyphenated word, all ASCII | bcl-w | {english_stem} | english_stem | {bcl-w}
hword_asciipart | Hyphenated word part, all ASCII | bcl | {english_stem} | english_stem | {bcl}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | w | {english_stem} | english_stem | {w}

unison@u8.3=> select * from ts_debug('english','bcl-2');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | bcl | {english_stem} | english_stem | {bcl}
int | Signed integer | -2 | {simple} | simple | {-2}

One option would be to write a new parser/modify wparser_def.c to make
the InHyphyenWordFirst accept p_isdigit or p_isalnum on the first
character (I think I got this right). This would achieve Tom's initial
inkling that Bcl-2 might be parsed as a numhword and (to me) it seems
more congruent with asciihword class.

Perhaps a more broadly useful modification is for the lexer to also emit
whitespace-delimited tokens (period). asciihword almost does the trick,
but it too requires a post-hyphen alphabetic character.

The second problem is with quantifiers on PCRE's regexps. I initially
implemented a dict_regex with a conf line like
(\w+)-(\w{1,2}) $1$2
I can make simpler expressions work (eg., (bcl)-(\w)). I think it must
be related to the README caveat regarding PCRE partial matching mode,
which I didn't understand initially.

However, I don't see that it's possible to write a general regexp like
the one I initially tried. Do you have any suggestions?

Thanks again. I'm very impressed with tsearch2.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0