using Tsearch2 for chemical text
Hi, I have a table with about 9M entries. The table has 2 fields: id
and name which are of serial and text types respectively. I have a
ordinary index on the text field which allows me to do searches in
reasonable time. Most of my searches are of the form
select * from mytable where name ~ 'some text query'
I know that the Tsearch2 module will let me have very efficient text
searches. But if I understand correctly, it's based on a language
specific dictionary.
My problem is that the name column contains names of chemicals. Now
for many cases this may simply be a number (1674-56-2) and in other
cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
(say viagra or calcium chloride or pentathol).
My question is: will Tsearch2 be able to handle this type of text? Or
will it be hampered by the fact that the bulk of the rows do not
correspond to ordinary English
-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
My Ethicator machine must have had a built-in moral
compromise spectral phantasmatron! I'm a genius."
-Calvin
Rajarshi Guha <rguha@indiana.edu> writes:
My problem is that the name column contains names of chemicals. Now
for many cases this may simply be a number (1674-56-2) and in other
cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
(say viagra or calcium chloride or pentathol).
My question is: will Tsearch2 be able to handle this type of text?
I think you might need to write a custom lexer to divide the strings
into meaningful units. If there are subsections of these names that
make sense to search for, then tsearch2 can certainly handle the
mechanics of that, but I doubt that the standard rules will divide
these names into lexemes usefully.
regards, tom lane
Tsearch2 is used for full text indexing. It won't be any faster than a
btree index like the one you have now (I assume it's unique -- if it
isn't then I think it ought to be). If you cluster the table by your
index it will speed up your queries.
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Rajarshi Guha
Sent: Wednesday, July 25, 2007 3:41 PM
To: pgsql-general
Subject: [GENERAL] using Tsearch2 for chemical textHi, I have a table with about 9M entries. The table has 2 fields: id
and name which are of serial and text types respectively. I have a
ordinary index on the text field which allows me to do searches in
reasonable time. Most of my searches are of the formselect * from mytable where name ~ 'some text query'
I know that the Tsearch2 module will let me have very efficient text
searches. But if I understand correctly, it's based on a language
specific dictionary.My problem is that the name column contains names of chemicals. Now
for many cases this may simply be a number (1674-56-2) and in other
cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
(say viagra or calcium chloride or pentathol).My question is: will Tsearch2 be able to handle this type of text? Or
will it be hampered by the fact that the bulk of the rows do not
correspond to ordinary English-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
My Ethicator machine must have had a built-in moral
compromise spectral phantasmatron! I'm a genius."
-Calvin---------------------------(end of
broadcast)---------------------------
Show quoted text
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Rajarshi Guha <rguha@indiana.edu> writes:
My problem is that the name column contains names of chemicals. Now
for many cases this may simply be a number (1674-56-2) and in other
cases it may be an alphanumeric string (such as (-)O-acetylcarnitine
or 1,2-cis-dihydroxybenzoate). In some cases it is a well-known word
(say viagra or calcium chloride or pentathol).My question is: will Tsearch2 be able to handle this type of text?
I think you might need to write a custom lexer to divide the strings
into meaningful units. If there are subsections of these names that
make sense to search for, then tsearch2 can certainly handle the
mechanics of that, but I doubt that the standard rules will divide
these names into lexemes usefully.regards, tom lane
We have similar problem since Japanese is an agglutinative
language. To solve the problem, we divide Japanese texts into space
separted "words" by using specialized tool, which has huge dictionary
to look for word boundaries. To make things easier, I have written a
simple C function which calls the tool and returns the space separated
texts.
Just for your information.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
I think you might need to write a custom lexer to divide the strings
into meaningful units. If there are subsections of these names that
make sense to search for, then tsearch2 can certainly handle the
mechanics of that, but I doubt that the standard rules will divide
these names into lexemes usefully.
A custom lexer for tsearch2 that recognized chemistry related lexical
components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would
increase *hugely* the out-of-the-box applicability of PostgreSQL to
scientific applications. Perhaps such an effort could be co ordinated
with a physics based lexer and biology related lexer, to perhaps provide
a unified lexer that provided full scientific capabilities in the way
that PostGIS provides unified geospatial capabilities.
I don't know how best to bring such an effort about, but I do know that
if such a thing were created it would be a boon for PostgreSQL, giving
it a very significant leg up in terms of functionality, not to mention
the great positive impact that the wide, free availability of such a
tool would have on the scientific research community.
On Wed, 25 Jul 2007, Rajarshi Guha wrote:
Hi, I have a table with about 9M entries. The table has 2 fields: id and name
which are of serial and text types respectively. I have a ordinary index on
the text field which allows me to do searches in reasonable time. Most of my
searches are of the formselect * from mytable where name ~ 'some text query'
I know that the Tsearch2 module will let me have very efficient text
searches. But if I understand correctly, it's based on a language specific
dictionary.
wrong ! it comes with some written human language dictionaries, but you can
write your very own dictionaries. dictionary is just a C-program.
My problem is that the name column contains names of chemicals. Now for many
cases this may simply be a number (1674-56-2) and in other cases it may be an
alphanumeric string (such as (-)O-acetylcarnitine or
1,2-cis-dihydroxybenzoate). In some cases it is a well-known word (say viagra
or calcium chloride or pentathol).My question is: will Tsearch2 be able to handle this type of text? Or will it
be hampered by the fact that the bulk of the rows do not correspond to
ordinary English
Oh, sure. See, for example, our dict_regex dictionary, we use for
astronomical search.
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html
This is a work in progress, but it works.
-------------------------------------------------------------------
Rajarshi Guha <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
My Ethicator machine must have had a built-in moral
compromise spectral phantasmatron! I'm a genius."
-Calvin---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Naz, in posted link to the dict_regex dictionary for tsearch2
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html
Feel free to test it and send us feedback. It's rather general, of course,
it uses regex (pcre library).
Oleg
On Thu, 26 Jul 2007, Naz Gassiep wrote:
I think you might need to write a custom lexer to divide the strings
into meaningful units. If there are subsections of these names that
make sense to search for, then tsearch2 can certainly handle the
mechanics of that, but I doubt that the standard rules will divide
these names into lexemes usefully.A custom lexer for tsearch2 that recognized chemistry related lexical
components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would
increase *hugely* the out-of-the-box applicability of PostgreSQL to
scientific applications. Perhaps such an effort could be co ordinated with a
physics based lexer and biology related lexer, to perhaps provide a unified
lexer that provided full scientific capabilities in the way that PostGIS
provides unified geospatial capabilities.I don't know how best to bring such an effort about, but I do know that if
such a thing were created it would be a boon for PostgreSQL, giving it a very
significant leg up in terms of functionality, not to mention the great
positive impact that the wide, free availability of such a tool would have on
the scientific research community.---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83