HTML tags and tsearch2

Started by Joanna Sharmanalmost 18 years ago2 messagesgeneral

Joanna.Sharman@ed.ac.uk

almost 18 years ago

Hi,

I have recently started experimenting with tsearch2 and it seems that
the default behaviour is to ignore HTML tags and treat them as
word-separators. What I would like it to do is to ignore HTML tags
within words, but instead of creating separate words, combine the
characters separated by the tag into one word.

For example: in the database I have words like 'K<sub>ir</sub>' that
need to be searched using the term without HTML tags, i.e. 'Kir'.
Currently, the HTML tags are ignored and two words are stored in the
vector, 'k' and 'ir'. I would like only one word, 'kir', to be stored
in the vector, so that searches using the word 'kir' will match the row.

A second, related question is whether it is possible to cause tsearch2
to split up words when it encounters digits, e.g. 'TM8' into 'TM' and
'8'.

I am not sure if this functionality is possible to implement using
tsearch2 or if there might be a better way, so I would be grateful for
any advice or pointers to further reading on how I might do this. (I
am using PostgreSQL version 8.1.10)

Many thanks in advance,
Joanna

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Oleg Bartunov

oleg@sai.msu.su

almost 18 years ago

In reply to: Joanna Sharman (#1)

Re: HTML tags and tsearch2

On Thu, 26 Jun 2008, Joanna Sharman wrote:

Hi,

I have recently started experimenting with tsearch2 and it seems that the
default behaviour is to ignore HTML tags and treat them as word-separators.
What I would like it to do is to ignore HTML tags within words, but instead
of creating separate words, combine the characters separated by the tag into
one word.

For example: in the database I have words like 'K<sub>ir</sub>' that need to
be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML
tags are ignored and two words are stored in the vector, 'k' and 'ir'. I
would like only one word, 'kir', to be stored in the vector, so that searches
using the word 'kir' will match the row.

2 options - write HTML parser and preprocess text before to_tsvector.

A second, related question is whether it is possible to cause tsearch2 to
split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.

you can write your own dictionary or use dict_regex from
http://vo.astronet.ru/arxiv/dict_regex.html

I am not sure if this functionality is possible to implement using tsearch2
or if there might be a better way, so I would be grateful for any advice or
pointers to further reading on how I might do this. (I am using PostgreSQL
version 8.1.10)

think about upgrading to 8.3

Many thanks in advance,
Joanna

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83