Clarification of the "simple" dictionary

Started by Andreas Joseph Kroghalmost 16 years ago7 messagesgeneral

andreak@officenet.no

almost 16 years ago

Hi. It's not clear to me if the "simple" dictionary uses stopwords or
not, does it?
Can someone please post a complete description of what the "simple"
dict. does?

John Gage

jsmgage@numericable.fr

almost 16 years ago

In reply to: Andreas Joseph Krogh (#1)

Re: Clarification of the "simple" dictionary

The easiest way to look at this is to give the simple dictionary a
document with to_tsvector() and see if stopwords pop out.

In my experience they do. In my experience, the simple dictionary
just breaks the document down into the space etc. separated words in
the document. It doesn't analyze further.

John

On Jul 22, 2010, at 4:15 PM, Andreas Joseph Krogh wrote:

Show quoted text

Hi. It's not clear to me if the "simple" dictionary uses stopwords
or not, does it?
Can someone please post a complete description of what the "simple"
dict. does?

-- 
Andreas Joseph Krogh<andreak@officenet.no>
Senior Software Developer / CTO
------------------------ 
+---------------------------------------------+
OfficeNet AS            | The most difficult thing in the world is  
to |
Rosenholmveien 25       | know how to do a thing and to  
watch         |
1414 Trollåsen          | somebody else doing it wrong,  
without       |
NORWAY                  |  
comment.                                    |
|                                             |
Tlf:    +47 24 15 38 90  
|                                             |
Fax:    +47 24 15 38 91  
|                                             |
Mobile: +47 909  56 963  
|                                             |
------------------------ 
+---------------------------------------------+

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andreas Joseph Krogh

andreak@officenet.no

almost 16 years ago

In reply to: John Gage (#2)

Re: Clarification of the "simple" dictionary

On 07/22/2010 06:27 PM, John Gage wrote:

The easiest way to look at this is to give the simple dictionary a
document with to_tsvector() and see if stopwords pop out.

In my experience they do. In my experience, the simple dictionary
just breaks the document down into the space etc. separated words in
the document. It doesn't analyze further.

That's my experience too, I just want to make sure it doesn't actually
have any stopwords which I've missed. Trying many phrases and checking
for stopwords isn't really proving anything.

Can anybody confirm the "simple" dict. only lowercases the words and
"uniques" them?

Oleg Bartunov

oleg@sai.msu.su

almost 16 years ago

In reply to: Andreas Joseph Krogh (#3)

Re: Clarification of the "simple" dictionary

Don't guess, but read docs
http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

12.6.2. Simple Dictionary

The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list.

By default it has no Init options, so it doesn't check for stopwords.

On Thu, 22 Jul 2010, Andreas Joseph Krogh wrote:

On 07/22/2010 06:27 PM, John Gage wrote:

The easiest way to look at this is to give the simple dictionary a document
with to_tsvector() and see if stopwords pop out.

In my experience they do. In my experience, the simple dictionary just
breaks the document down into the space etc. separated words in the
document. It doesn't analyze further.

That's my experience too, I just want to make sure it doesn't actually have
any stopwords which I've missed. Trying many phrases and checking for
stopwords isn't really proving anything.

Can anybody confirm the "simple" dict. only lowercases the words and
"uniques" them?

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Andreas Joseph Krogh

andreak@officenet.no

almost 16 years ago

In reply to: Oleg Bartunov (#4)

Re: Clarification of the "simple" dictionary

On 07/22/2010 07:44 PM, Oleg Bartunov wrote:

Don't guess, but read docs
http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

12.6.2. Simple Dictionary

The simple dictionary template operates by converting the input token
to lower case and checking it against a file of stop words. If it is
found in the file then an empty array is returned, causing the token
to be discarded. If not, the lower-cased form of the word is returned
as the normalized lexeme. Alternatively, the dictionary can be
configured to report non-stop-words as unrecognized, allowing them to
be passed on to the next dictionary in the list.

d=# \dFd+ simple
List of text search
dictionaries
Schema | Name | Template | Init options
| Description
------------+--------+-------------------+--------------+-----------------------------------------------------------

pg_catalog | simple | pg_catalog.simple | | simple
dictionary: just lower case and check for stopword

By default it has no Init options, so it doesn't check for stopwords.

Guess what - I *have* read the docs which sais "...and checking it
against a file of stop words". What was unclear to me was whether or not
it was configured with a stopwords-file or not as default, which is not
the case I understand from your reply. Very good, fits my needs like a
glove:-) It might be worth considering updating the docs to make this
clearer?

So - can we rely on "simple" to remain this way forever (no Init
options) or is it better to make a copy of it with the same properties
as today?

It seems "simple" + the unaccent dict. available in 9.0 saves my day,
thanks Mr. Bartunov.

Oleg Bartunov

oleg@sai.msu.su

almost 16 years ago

In reply to: Andreas Joseph Krogh (#5)

Re: Clarification of the "simple" dictionary

Andreas,

I'd create myself copy of dictionary to be independent on system changes.

Oleg
On Thu, 22 Jul 2010, Andreas Joseph Krogh wrote:

On 07/22/2010 07:44 PM, Oleg Bartunov wrote:

Don't guess, but read docs
http://www.postgresql.org/docs/8.4/interactive/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

12.6.2. Simple Dictionary

The simple dictionary template operates by converting the input token to
lower case and checking it against a file of stop words. If it is found in
the file then an empty array is returned, causing the token to be
discarded. If not, the lower-cased form of the word is returned as the
normalized lexeme. Alternatively, the dictionary can be configured to
report non-stop-words as unrecognized, allowing them to be passed on to the
next dictionary in the list.

d=# \dFd+ simple
List of text search dictionaries
Schema | Name | Template | Init options |
Description
------------+--------+-------------------+--------------+-----------------------------------------------------------
pg_catalog | simple | pg_catalog.simple | | simple
dictionary: just lower case and check for stopword

By default it has no Init options, so it doesn't check for stopwords.

Guess what - I *have* read the docs which sais "...and checking it against a
file of stop words". What was unclear to me was whether or not it was
configured with a stopwords-file or not as default, which is not the case I
understand from your reply. Very good, fits my needs like a glove:-) It might
be worth considering updating the docs to make this clearer?

So - can we rely on "simple" to remain this way forever (no Init options) or
is it better to make a copy of it with the same properties as today?

It seems "simple" + the unaccent dict. available in 9.0 saves my day, thanks
Mr. Bartunov.

John Gage

jsmgage@numericable.fr

almost 16 years ago

In reply to: Oleg Bartunov (#4)

Re: Clarification of the "simple" dictionary

By default it has no Init options, so it doesn't check for stopwords.

In the first place, this functionality is a rip-snorting home run on
Postgres. I congratulate Oleg who I believe is one of the authors.

In the second, I too had not read (carefully) the documentation and am
very happy to find that I can eliminate stop words with 'simple'.
That will be a tremendous convenience going forward.

It turns out that using 'english' and getting stemmed lexemes is
extremely convenient too, but this functionality in 'simple' is
excellent.

Thanks,

John