Snowball and ispell in tsearch2
We got a lot requests about including stemmers and ispell dictionaries for all
accessible languages into tsearch2. I understand that tsearch2 will be closer to
end user. But sources of snowball stemmers is about 800kb, each ispell
dictionaries will takes about 0.5-2M. All sizes are sized with compression. I am
afraid that is too big size...
What are opinions?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Hello Teodor,
I've just recently implemented an advanced full-text search function on
top of tsearch2. Searching through the manuals and websites to get the
snowball stemmer and compile my own module took me way to long. I'd
rather go fetch a cup of coffee during a 30 minute download...
That said, I don't necessarily mean that all stemmers must be included
in CVS or such. It should just be simpler for the database administrator
to install ispell or stemmer 'modules'. A non-plus-ultra solution would
be to provide packages for each language (in debian or fedora, etc..).
Perhaps we can put together the source code for all languages modules
available and provide scripts to fetch ispell data or to generate the
snowball stemmers. A debian package maintainer would have to fetch all
the data to generate all language packages. Someone else might just want
to download and compile a norwegian snowball stemmer.
I'd be willing to help with such a project. I have experience with
tsearch2 as well as with gentoo and debian packaging. I can't help with
rpm, though.
Regards
Markus
Teodor Sigaev wrote:
Show quoted text
We got a lot requests about including stemmers and ispell dictionaries
for all accessible languages into tsearch2. I understand that tsearch2
will be closer to end user. But sources of snowball stemmers is about
800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
sized with compression. I am afraid that is too big size...What are opinions?
800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
Sorry, withOUT compression...
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
OpenFTS ebuild: http://bugs.gentoo.org/show_bug.cgi?id=135859
It has a USE flag for the snowball stemmer. I can take care of
packaging for Gentoo if it will free up time for you to work on other
distros.
John
PS, upstream package size isn't, and shouldn't be an issue, it should
be left to the packaging systems to discretely fetch what is needed.
On 6/7/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
Show quoted text
That said, I don't necessarily mean that all stemmers must be included
in CVS or such. It should just be simpler for the database administrator
to install ispell or stemmer 'modules'. A non-plus-ultra solution would
be to provide packages for each language (in debian or fedora, etc..).I'd be willing to help with such a project. I have experience with
tsearch2 as well as with gentoo and debian packaging. I can't help with
rpm, though.Regards
Markus
Teodor Sigaev wrote:
We got a lot requests about including stemmers and ispell dictionaries
for all accessible languages into tsearch2. I understand that tsearch2
will be closer to end user. But sources of snowball stemmers is about
800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
sized with compression. I am afraid that is too big size...What are opinions?
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
We got a lot requests about including stemmers and ispell dictionaries
for all accessible languages into tsearch2. I understand that tsearch2
will be closer to end user. But sources of snowball stemmers is about
800kb, each ispell dictionaries will takes about 0.5-2M. All sizes are
sized with compression. I am afraid that is too big size...What are opinions?
Maybe putting it on pgFoundry?
Perhaps we can put together the source code for all languages modules
available and provide scripts to fetch ispell data or to generate the
snowball stemmers. A debian package maintainer would have to fetch all
the data to generate all language packages. Someone else might just want
to download and compile a norwegian snowball stemmer.I'd be willing to help with such a project. I have experience with
tsearch2 as well as with gentoo and debian packaging. I can't help with
rpm, though.
I could help with a FreeBSD package I suppose.
I'd be willing to help with such a project. I have experience with
tsearch2 as well as with gentoo and debian packaging. I can't help
with rpm, though.I could help with a FreeBSD package I suppose.
Although I should probably finish up those damn GIN docs first :)
Maybe putting it on pgFoundry?
Hmm, it's a variant. We can create project 'tsearch2_dict' and there I'll place
contrib module which will make all Snowball stemmers. Right now I'm working on
supporting OpenOffice's dictionaries in tsearch2, so it will be simple to add it
to packaging system.
I suggest that in the same cvs somebody will manage packages/package's builder
for different packaging system (sorry, I havn't any experience with that systems)
BTW, it will be good, if packaging will work with "maked" postgres, something like:
% cd PGSQL/contrib/tsearch2
% make LANG=norwegian
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
I'll place contrib module which will make all Snowball stemmers. Right
now I'm working on supporting OpenOffice's dictionaries in tsearch2, so
it will be simple to add it to packaging system.
done, http://archives.postgresql.org/pgsql-committers/2006-06/msg00112.php
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/