tsearch2 dictionary that indexes substrings?

Started by Tilmann Singeralmost 19 years ago7 messagesgeneral
Jump to latest
#1Tilmann Singer
tils-pgsql@tils.net

Hi,

In this http://archive.netbsd.se/?ml=pgsql-hackers&a=2006-10&t=2475028
thread Teodor Sigaev describes a way to make tsearch2 index substrings
of words:

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for
example for word foobar it will be 'foobar fooba foob foo fo oobar
ooba oob oo obar oba ob bar ba ar'

Did anyone do that already?

If I understand it correctly such a dictionary would require to write
a custom C component - is that correct? Or could I get away with
writing a plpgsql function that does the above and hooking that
somehow into the tsearch2 config?

thanks, Til

#2PFC
lists@peufeu.com
In reply to: Tilmann Singer (#1)
Re: tsearch2 dictionary that indexes substrings?

You want trigram based search.
ie.

postgresql -> 'pos', 'ost', 'stg', 'tgr', 'gre', 'res', 'esq', 'sql'

searching for 'gresq' is searching for 'gre' and 'res' and 'esq' which
is good friends with bitmap scan. Then a little LIKE '%gresq%' to filter
the results.

PS : indexing all substring means for long words you get huge number of
lexems...

On Fri, 20 Apr 2007 11:06:26 +0200, Tilmann Singer <tils-pgsql@tils.net>
wrote:

Show quoted text

Hi,

In this http://archive.netbsd.se/?ml=pgsql-hackers&amp;a=2006-10&amp;t=2475028
thread Teodor Sigaev describes a way to make tsearch2 index substrings
of words:

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for
example for word foobar it will be 'foobar fooba foob foo fo oobar
ooba oob oo obar oba ob bar ba ar'

Did anyone do that already?

If I understand it correctly such a dictionary would require to write
a custom C component - is that correct? Or could I get away with
writing a plpgsql function that does the above and hooking that
somehow into the tsearch2 config?

thanks, Til

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#3Oleg Bartunov
oleg@sai.msu.su
In reply to: Tilmann Singer (#1)
Re: tsearch2 dictionary that indexes substrings?

On Fri, 20 Apr 2007, Tilmann Singer wrote:

Hi,

In this http://archive.netbsd.se/?ml=pgsql-hackers&amp;a=2006-10&amp;t=2475028
thread Teodor Sigaev describes a way to make tsearch2 index substrings
of words:

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for
example for word foobar it will be 'foobar fooba foob foo fo oobar
ooba oob oo obar oba ob bar ba ar'

Did anyone do that already?

AFAIK, no

If I understand it correctly such a dictionary would require to write
a custom C component - is that correct? Or could I get away with
writing a plpgsql function that does the above and hooking that
somehow into the tsearch2 config?

You need to write C-function, see example in
http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#4Tilmann Singer
tils-pgsql@tils.net
In reply to: PFC (#2)
Re: tsearch2 dictionary that indexes substrings?

* Listmail <lists@peufeu.com> [20070420 11:25]:

You want trigram based search.
ie.

postgresql -> 'pos', 'ost', 'stg', 'tgr', 'gre', 'res',
'esq', 'sql'

searching for 'gresq' is searching for 'gre' and 'res' and
'esq' which is good friends with bitmap scan. Then a little LIKE
'%gresq%' to filter the results.

I'm not sure how that would fit in with tsearch2 to do full text
search so that I can do queries like

select * from content where plainto_tsquery(:q) @@ to_tsvector(body)

If the possible substrings were already indexed like Oleg suggested in
his reply through writing a custom C dictionary, a query like above
with q='foo' would find rows from the table content where body
contains 'foobar' for instance.

However I've seen the example to create a trigram index on a unique
word list to provide alternative spelling suggestions to the user
which looked very useful.

PS : indexing all substring means for long words you get huge
number of lexems...

I'm aware of that and in my case I don't think it will be a
problem. It is for a type-ahead search web interface so actually it
only requires indexing all possible substrings starting from char 1,
ie. p, po, pos, post, postg, postgr, postgre, postgres, postgresq,
postgresql.

Til

#5PFC
lists@peufeu.com
In reply to: Tilmann Singer (#4)
Re: tsearch2 dictionary that indexes substrings?

I'm aware of that and in my case I don't think it will be a
problem. It is for a type-ahead search web interface so actually it
only requires indexing all possible substrings starting from char 1,
ie. p, po, pos, post, postg, postgr, postgre, postgres, postgresq,
postgresql.

If you want to provide autocompletion, you could build a list of unique
lexemes from inserted posts, and use a btree index and LIKE...

#6Tilmann Singer
tils-pgsql@tils.net
In reply to: Oleg Bartunov (#3)
Re: tsearch2 dictionary that indexes substrings?

* Oleg Bartunov <oleg@sai.msu.su> [20070420 11:32]:

If I understand it correctly such a dictionary would require to write
a custom C component - is that correct? Or could I get away with
writing a plpgsql function that does the above and hooking that
somehow into the tsearch2 config?

You need to write C-function, see example in
http://www.sai.msu.su/~megera/postgres/fts/doc/fts-intdict-xmp.html

Thanks.

My colleague who speaks more C than me came up with the code below
which works fine for us. Will the memory allocated for lexeme be freed
by the caller?

Til

/*
* Dictionary for partials of a word, ie. foo => {f,fo,foo}
*
* Based on the tsearch2/gendict/config.sh generator
*
* Author: Sean Treadway
*
* This code is released under the terms of the PostgreSQL License.
*/
#include "postgres.h"

#include "dict.h"
#include "common.h"

#include "subinclude.h"
#include "ts_locale.h"

#define is_utf8_continuation(c) ((unsigned char)(c) >= 0x80 && (unsigned char)(c) <= 0xBF)

PG_FUNCTION_INFO_V1(dlexize_partial);
Datum dlexize_partial(PG_FUNCTION_ARGS);
Datum
dlexize_partial(PG_FUNCTION_ARGS) {
char* in = (char*)PG_GETARG_POINTER(1);

char* utxt = pnstrdup(in, PG_GETARG_INT32(2)); /* palloc */
char* txt = lowerstr(utxt); /* palloc */
int txt_len = strlen(txt);

int results = 0;
int i = 0;

/* may overallocate, that's ok */
TSLexeme *res = palloc(sizeof(TSLexeme)*(txt_len+1));

for (i = 1; i <= txt_len; i++) {
/* skip UTF8 control codes until EOS */
if (!is_utf8_continuation(txt[i])) {
res[results++].lexeme = pnstrdup(txt, i);
}
}

res[results].lexeme=NULL;

pfree(utxt);
pfree(txt);

/* Receiver must free res memory and res[].lexeme */
PG_RETURN_POINTER(res);
}

#7Teodor Sigaev
teodor@sigaev.ru
In reply to: Tilmann Singer (#6)
Re: tsearch2 dictionary that indexes substrings?

My colleague who speaks more C than me came up with the code below
which works fine for us. Will the memory allocated for lexeme be freed

Nice, except self-defined utf8 properties. I think it will be much better to use
pg_mblen(char*). In this case your dictionary will work with any supported by
pgsql encodings.

by the caller?

Yes, of course.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/