How to drop all tokens that a snowball dictionary cannot stem?

Started by Christoph Gößmannover 6 years ago4 messagesgeneral
Jump to latest
#1Christoph Gößmann
mail@goessmann.io

Hi everybody,

I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?

I went ahead with creating a new configuration:

-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);

-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;

EXAMPLE:

SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7

In this tsvector, I would like 'somejnk' and 'tt' not to be included.

Many thanks,
Christoph

#2Jeff Janes
jeff.janes@gmail.com
In reply to: Christoph Gößmann (#1)
Re: How to drop all tokens that a snowball dictionary cannot stem?

On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail@goessmann.io> wrote:

Hi everybody,

I am trying to get all the lexemes for a text using to_tsvector(). But I
want only words that english_stem -- the integrated snowball dictionary --
is able to handle to show up in the final tsvector. Since snowball
dictionaries only remove stop words, but keep the words that they cannot
stem, I don't see an easy option to do this. Do you have any ideas?

I went ahead with creating a new configuration:

-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY =
pg_catalog.english);

-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;

EXAMPLE:

SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt
somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7

In this tsvector, I would like 'somejnk' and 'tt' not to be included.

I don't think the question is well defined. It will happily stem
'somejnking' to ' somejnk', doesn't that mean that it **can** handle it?
The fact that 'somejnk' itself wasn't altered during stemming doesn't mean
it wasn't handled, just like 'test' wasn't altered during stemming.

Cheers,

Jeff

#3Christoph Gößmann
mail@goessmann.io
In reply to: Jeff Janes (#2)
Re: How to drop all tokens that a snowball dictionary cannot stem?

Hi Jeff,

You're right about that point. Let me redefine. I would like to drop all tokens which neither are the stemmed or unstemmed version of a known word. Would there be the possibility of putting a wordlist as a filter ahead of the stemming? Or do you know about a good English lexeme list that could be used to filter after stemming?

Thanks,
Christoph

Show quoted text

On 23. Nov 2019, at 16:27, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail@goessmann.io <mailto:mail@goessmann.io>> wrote:
Hi everybody,

I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?

I went ahead with creating a new configuration:

-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);

-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;

EXAMPLE:

SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7

In this tsvector, I would like 'somejnk' and 'tt' not to be included.

I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming.

Cheers,

Jeff

#4Jeff Janes
jeff.janes@gmail.com
In reply to: Christoph Gößmann (#3)
Re: How to drop all tokens that a snowball dictionary cannot stem?

On Sat, Nov 23, 2019 at 10:42 AM Christoph Gößmann <mail@goessmann.io>
wrote:

Hi Jeff,

You're right about that point. Let me redefine. I would like to drop all
tokens which neither are the stemmed or unstemmed version of a known word.
Would there be the possibility of putting a wordlist as a filter ahead of
the stemming? Or do you know about a good English lexeme list that could be
used to filter after stemming?

I think what you describe is the opposite of what snowball was designed to
do. You want an ispell-based dictionary instead.

PostgreSQL doesn't ship with real ispell dictionaries, so you have to
retrieve the files yourself and install them into $SHAREDIR/tsearch_data as
described in the docs for
https://www.postgresql.org/docs/12/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

Cheers,

Jeff