contrib/tsearch
Hi Oleg/Teodor,
I'm sorry to keep posting bugs without patches, but I'm just hoping you guys
know the answer faster than I...I know you're busy.
What does tsearch have against the word 'herring' (as in the fish). Why is
it considered a stopword?
Attached is example queries...
Chris
Attachments:
Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
the stemming process and hence is thought to be a stopword? This is a bug,
but how should it be fixed?
Although, tests don't support that:
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himring';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hisring';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hising';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himing';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
All work...?
Chris
Show quoted text
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
Kings-Lynne
Sent: Thursday, 5 September 2002 2:36 PM
To: Hackers
Subject: [HACKERS] contrib/tsearchHi Oleg/Teodor,
I'm sorry to keep posting bugs without patches, but I'm just
hoping you guys
know the answer faster than I...I know you're busy.What does tsearch have against the word 'herring' (as in the
fish). Why is
it considered a stopword?Attached is example queries...
Chris
On Thu, 5 Sep 2002, Christopher Kings-Lynne wrote:
Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
the stemming process and hence is thought to be a stopword? This is a bug,
but how should it be fixed?
It's difficult question how to use stop words. We'll see what we could
do. Probably, porter's stemming algorithm has problem here.
'herring' -> 'her'~'ring'
(I have a demo of english-russian stemmr, so you can play)
http://intra.astronet.ru/db/lingua/snowball/
I'll ask Martin Porter if there could be an error stemmer.
But I think the problem is in concept of using stop words.
Should we check for stop words before stemming or after ?
In the first case we have to collect all forms of stop-words which is doable
but difficult to maintain, in latter - we'll have current problem.
It's time for beta1 and I'm not sure if we could work on this issue
right now, but I feel a big pressure from tsearch users :-)
If people want to help us why not to work on stop words list including
all forms ? In any case, we are not native english, so don't expect we'll
create more or less decent list. Programming changes are trivial, probably
we'll end for the moment just using compile time option.
As always, your patches are welcome !
btw, you may test your queries much easier:
list=# select 'herring'::mquery_txt;
ERROR: Your query contained only stopword(s), ignored
list=# select 'herring'::query_txt;
query_txt
-----------
'herring'
(1 row)
Although, tests don't support that:
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himring';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hisring';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hising';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himing';
food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)All work...?
Chris
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
Kings-Lynne
Sent: Thursday, 5 September 2002 2:36 PM
To: Hackers
Subject: [HACKERS] contrib/tsearchHi Oleg/Teodor,
I'm sorry to keep posting bugs without patches, but I'm just
hoping you guys
know the answer faster than I...I know you're busy.What does tsearch have against the word 'herring' (as in the
fish). Why is
it considered a stopword?Attached is example queries...
Chris
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Oleg,
The Porter stemming stems herring and herrings to her, which is a bit
unfortunate. A quick fix is to put 'herring/herrings' in the exception list
in the english (porter2) stemmer, but I'll look at this case over the next
few days and see if I can come up with something a bit better.
Interesting that no one has reported this before.
Martin
Import Notes
Resolved by subject fallback
On Thu, 5 Sep 2002, Martin Porter wrote:
Oleg,
The Porter stemming stems herring and herrings to her, which is a bit
unfortunate. A quick fix is to put 'herring/herrings' in the exception list
in the english (porter2) stemmer, but I'll look at this case over the next
few days and see if I can come up with something a bit better.
Unfrtunately, we wrote tsearch module before the Snowball project has started,
so we used one implementation we found in the net (www.muscat.com) and
there is no exception list. OpenFTS uses snowball stemming, so we'd like
to have a fix. I think we have enough arguments to use snowball stemmers
in tsearch also.
Interesting that no one has reported this before.
:-) Thanks Cristopher for his persistence.
Martin
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Should we check for stop words before stemming or after ?
I think you should.
In the first case we have to collect all forms of stop-words
which is doable
but difficult to maintain, in latter - we'll have current problem.
Looking at the list of stopwords you sent me, Oleg, there are only about 1
out of the list of 120 stopwords that need to have all word forms added. I
also don't think it'll be a maintenance problem. The reason I think this is
because stopwords in general don't have different word forms.
eg. her, his, i, and, etc. They don't have different forms. In fact, the
_only_ word in the stopword list that needs a different form is yourself and
yourselves. Actually, according to dictionary.com 'ourself' is also a word.
'themself' isn't tho. Some others I don't know about are:
'veri' - I assume this is stemmed 'very', so why not just use 'very'?
So, why don't you change tsearch to check for stop words _before_ stemming?
I can give you a list of revised stopwords that haven't been stemmed, with
all forms of the words.
It's time for beta1 and I'm not sure if we could work on this issue
right now, but I feel a big pressure from tsearch users :-)
If people want to help us why not to work on stop words list including
all forms ? In any case, we are not native english, so don't expect we'll
create more or less decent list. Programming changes are trivial, probably
we'll end for the moment just using compile time option.
As always, your patches are welcome !
I'm happy to work on the list of stopwords for you, Oleg. I agree this
might be 7.4 thing though...
Chris
Looking at the list of stopwords you sent me, Oleg, there are only about 1
out of the list of 120 stopwords that need to have all word forms
added. I
also don't think it'll be a maintenance problem. The reason I
think this is
because stopwords in general don't have different word forms.
Actually, it just occurred to me that stuff like:
will
won't
it
it's
where
where's
Will all have to be in the list, right?
Chris
There also seems to be a more complete list of english stopwords here:
http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/
However this list again does not include contractions. I can take this
list, check it and submit it to you Oleg, but do you want me to add
contractions?
eg. wasn't, isn't, it's, etc.?
Chris
Show quoted text
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
Kings-Lynne
Sent: Friday, 6 September 2002 12:20 PM
To: Christopher Kings-Lynne; Oleg Bartunov
Cc: Hackers; martin_porter@softhome.net
Subject: Re: [HACKERS] contrib/tsearchLooking at the list of stopwords you sent me, Oleg, there are
only about 1
out of the list of 120 stopwords that need to have all word forms
added. I
also don't think it'll be a maintenance problem. The reason I
think this is
because stopwords in general don't have different word forms.Actually, it just occurred to me that stuff like:
will
won't
it
it's
where
where'sWill all have to be in the list, right?
Chris
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:
There also seems to be a more complete list of english stopwords here:
http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/
Chris, I think we have to separate stop word list from tsearch package and
supply just some defaults. The reason for this is to let user decide what is
a stop word - various domains should have different stop words.
This is how OpenFTS works.
Also, we probably need to let user decide when to check for stop word -
after or before stemming. I'm waiting for Martin's fix for english stemmerr
and probably we'll switch to use snowball one, which are more qualified.
Damn, we wanted to do these and much more a bit later because we're under
big pressure of our work. We'll see if we could manage our plans.
We certainly need developers to help us in full text searching,
ltree ( it has a chance to support XML ). Also we need to work
on adding concurrency support to GiST.
so, I couldn't promise we'll work on tsearch right now, but we provide
makedict.pl so you could build dictionary with custom list of stop words.
Did you try it ?
However this list again does not include contractions. I can take this
list, check it and submit it to you Oleg, but do you want me to add
contractions?eg. wasn't, isn't, it's, etc.?
Hmm, our parser isn't smart to handle them as a single word, so
it'll not helps:
13:30:03[megera@amon]~/app/fts/test-suite>./testdict.pl -p
wasn't
lexeme:wasn:1:Latin word
lexeme:':12:Space symbols
lexeme:t:1:Latin word
But, you always could add 'wasn', 'isn' ... and 't','s' to list of your
stop words and be happy. Hmm, probably we could enhance our parser to
handle such words too.
Anyway, most problems just a question of time we don't have :-(
Chris
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Christopher
Kings-Lynne
Sent: Friday, 6 September 2002 12:20 PM
To: Christopher Kings-Lynne; Oleg Bartunov
Cc: Hackers; martin_porter@softhome.net
Subject: Re: [HACKERS] contrib/tsearchLooking at the list of stopwords you sent me, Oleg, there are
only about 1
out of the list of 120 stopwords that need to have all word forms
added. I
also don't think it'll be a maintenance problem. The reason I
think this is
because stopwords in general don't have different word forms.Actually, it just occurred to me that stuff like:
will
won't
it
it's
where
where'sWill all have to be in the list, right?
Chris
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:
Looking at the list of stopwords you sent me, Oleg, there are only about 1
out of the list of 120 stopwords that need to have all word forms
added. I
also don't think it'll be a maintenance problem. The reason I
think this is
because stopwords in general don't have different word forms.Actually, it just occurred to me that stuff like:
will
won't
it
it's
where
where'sWill all have to be in the list, right?
right, see my previous message. Teodor is our main developer, he should be
back from vacation very soon. But he already has many assignments regarding
our main project. Are there one smart programmer ?
Chris
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:
Should we check for stop words before stemming or after ?
I think you should.
In the first case we have to collect all forms of stop-words
which is doable
but difficult to maintain, in latter - we'll have current problem.Looking at the list of stopwords you sent me, Oleg, there are only about 1
out of the list of 120 stopwords that need to have all word forms added. I
also don't think it'll be a maintenance problem. The reason I think this is
because stopwords in general don't have different word forms.eg. her, his, i, and, etc. They don't have different forms. In fact, the
_only_ word in the stopword list that needs a different form is yourself and
yourselves. Actually, according to dictionary.com 'ourself' is also a word.
'themself' isn't tho. Some others I don't know about are:'veri' - I assume this is stemmed 'very', so why not just use 'very'?
That's because we currently check for stop word after stemming and
I think porters algorithm converts 'very' to 'veri' :-)
So, why don't you change tsearch to check for stop words _before_ stemming?
I can give you a list of revised stopwords that haven't been stemmed, with
all forms of the words.
I agree that english list is, probably, easy to maintain, but what about
other languages ? We don't have any volunteers - you're the first one.
It's time for beta1 and I'm not sure if we could work on this issue
right now, but I feel a big pressure from tsearch users :-)
If people want to help us why not to work on stop words list including
all forms ? In any case, we are not native english, so don't expect we'll
create more or less decent list. Programming changes are trivial, probably
we'll end for the moment just using compile time option.
As always, your patches are welcome !I'm happy to work on the list of stopwords for you, Oleg. I agree this
might be 7.4 thing though...
We always could keep updates separately on our page and in CVS.
Chris
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Should we check for stop words before stemming or after ?
Current implementation supports both variants. Look dictionary interface
definition in morph.c:
typedef struct
{
char localename[NAMEDATALEN];
/* init dictionary */
void *(*init) (void);
/* close dictionary */
void (*close) (void *);
/* find in dictionary */
char *(*lemmatize) (void *, char *, int *);
int (*is_stoplemm) (void *, char *, int);
int (*is_stemstoplemm) (void *, char *, int);
} DICT;
'is_stoplemm' method is called before 'lemmtize' and 'is_stemstoplemm' after.
dict/porter_english.dct at the end:
TABLE_DICT_START
"C",
setup_english_stemmer,
closedown_english_stemmer,
engstemming,
NULL,
is_stopengword
TABLE_DICT_END
dict/russian_stemming.dct:
TABLE_DICT_START
"ru_RU.KOI8-R",
NULL,
NULL,
ru_RUKOI8R_stem,
ru_RUKOI8R_is_stopword,
NULL
TABLE_DICT_END
So english stemmer defines is lexem stop or not after stemming, but russian before.
--
Teodor Sigaev
teodor@stack.net