What is the simpliest text search configuration?

Started by Jérôme Etévéover 16 years ago5 messagesgeneral

Jérôme Etévé

jerome.eteve@gmail.com

over 16 years ago

Hi all,

I'd like to implement a full text search with postgresql, and I can't find
a text search configuration that would just:

map unicode accentuated letters to an un-accentuated equivalent
tokenize the words (and skip any non word characters)
no stopwords
lower case the tokens

How can I achieve this? I'm particularly interested in deactivating
the stopwords filtering.

I tried pg_catalog.simple, but despite its name, it still considers stop words.

Thanks for your help!

Jerome.

--
Jerome Eteve.
http://www.eteve.net
jerome@eteve.net

m.nacos@gmail.com

over 16 years ago

In reply to: Jérôme Etévé (#1)

Re: What is the simpliest text search configuration?

Dear Jerome,

from personal experience full-text searching in PostgreSQL can be quite
powerful
but it's not simple, it requires thought, planning and coding. PostgreSQL
mainly
provides an efficient token matching mechanism supporting positional
information
and weights, but natural language processing and normalization is pretty
basic.

If you don't mind writing a couple of user-defined functions to take control
of lexeme
normalization, then tsvector/tsquery support can be a very powerful tool for
custom
search engines.

regards,

Michael

2009/11/12 Jérôme Etévé <jerome.eteve@gmail.com>

Show quoted text

Hi all,

I'd like to implement a full text search with postgresql, and I can't find
a text search configuration that would just:

map unicode accentuated letters to an un-accentuated equivalent
tokenize the words (and skip any non word characters)
no stopwords
lower case the tokens

How can I achieve this? I'm particularly interested in deactivating
the stopwords filtering.

I tried pg_catalog.simple, but despite its name, it still considers stop
words.

Thanks for your help!

Jerome.

Jérôme Etévé

jerome.eteve@gmail.com

over 16 years ago

In reply to: Michael Nacos (#2)

Re: What is the simpliest text search configuration?

Hi Michael,

I actually found that the 'simple' dictionary doesn't enforce a
stopword list by default. so i defined my search conf like this and it
works:

create text search configuration sbsimple ( parser = 'default' ) ;
alter text search configuration sbsimple ALTER MAPPING FOR
word,hword,asciiword,asciihword WITH simple

Cheers!

J.

2009/11/12 Michael Nacos <m.nacos@gmail.com>:

Dear Jerome,

from personal experience full-text searching in PostgreSQL can be quite
powerful
but it's not simple, it requires thought, planning and coding. PostgreSQL
mainly
provides an efficient token matching mechanism supporting positional
information
and weights, but natural language processing and normalization is pretty
basic.

If you don't mind writing a couple of user-defined functions to take control
of lexeme
normalization, then tsvector/tsquery support can be a very powerful tool for
custom
search engines.

regards,

Michael

2009/11/12 Jérôme Etévé <jerome.eteve@gmail.com>

Hi all,

I'd like to implement a full text search with postgresql, and I can't
find
a text search configuration that would just:

map unicode accentuated letters to an un-accentuated equivalent
tokenize the words (and skip any non word characters)
no stopwords
lower case the tokens

How can I achieve this? I'm particularly interested in deactivating
the stopwords filtering.

I tried pg_catalog.simple, but despite its name, it still considers stop
words.

Thanks for your help!

Jerome.

--
Jerome Eteve.
http://www.eteve.net
jerome@eteve.net

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Jérôme Etévé (#1)

Re: What is the simpliest text search configuration?

=?UTF-8?B?SsOpcsO0bWUgRXTDqXbDqQ==?= <jerome.eteve@gmail.com> writes:

I'd like to implement a full text search with postgresql, and I can't find
a text search configuration that would just:

map unicode accentuated letters to an un-accentuated equivalent
tokenize the words (and skip any non word characters)
no stopwords
lower case the tokens

How can I achieve this? I'm particularly interested in deactivating
the stopwords filtering.

I tried pg_catalog.simple, but despite its name, it still considers stop words.

What's wrong with specifying an empty stopword list?

(To me, removing accents is already past what I'd expect of a "simple"
configuration, so I doubt you're going to find a dictionary that
provides exactly that set of features and no other ones.)

regards, tom lane

oleg@sai.msu.su

over 16 years ago

In reply to: Jérôme Etévé (#3)

Re: What is the simpliest text search configuration?

We submitted unaccent dictionary for 8.5
See http://www.sai.msu.su/~megera/wiki/unaccent for some information

Oleg
On Thu, 12 Nov 2009, Jrme Etv wrote:

Hi Michael,

I actually found that the 'simple' dictionary doesn't enforce a
stopword list by default. so i defined my search conf like this and it
works:

create text search configuration sbsimple ( parser = 'default' ) ;
alter text search configuration sbsimple ALTER MAPPING FOR
word,hword,asciiword,asciihword WITH simple

Cheers!

J.

2009/11/12 Michael Nacos <m.nacos@gmail.com>:

Dear Jerome,

from personal experience full-text searching in PostgreSQL can be quite
powerful
but it's not simple, it requires thought, planning and coding. PostgreSQL
mainly
provides an efficient token matching mechanism supporting positional
information
and weights, but natural language processing and normalization is pretty
basic.

If you don't mind writing a couple of user-defined functions to take control
of lexeme
normalization, then tsvector/tsquery support can be a very powerful tool for
custom
search engines.

regards,

Michael

2009/11/12 JЪЪrЪЪme EtЪЪvЪЪ <jerome.eteve@gmail.com>

Hi all,

I'd like to implement a full text search with postgresql, and I can't
find
a text search configuration that would just:

map unicode accentuated letters to an un-accentuated equivalent
tokenize the words (and skip any non word characters)
no stopwords
lower case the tokens

How can I achieve this? I'm particularly interested in deactivating
the stopwords filtering.

I tried pg_catalog.simple, but despite its name, it still considers stop
words.

Thanks for your help!

Jerome.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83