Improving FTS for Greek

Started by Florents Tselaiover 2 years ago4 messages
#1Florents Tselai
florents.tselai@gmail.com

I posted earlier in pgsql-general, that I realised there’s no greek.stop under $(pg_config —sharedir)/tsearch_data

And indeed looks like stop words are maintained with to_tsvector(‘greek’, ..).

I wrote an extension https://github.com/Florents-Tselai/pg_fts_greek that adds another ‘greek_ext’ regconfig

Here’s how the results compare

t to_tsvector('greek', t) to_tsvector('greek_ext', t)
'το τετράγωνο της υποτείνουσας ενός ορθογωνίου τριγώνου' 'εν':5 'ορθογων':6 'τ':3 'τετραγων':2 'το':1 'τριγων':7 'υποτεινουσ':4 'εν':5 'ορθογων':6 'τετραγων':2 'τριγων':7 'υποτεινουσ':4
'ο γιώργος είναι πονηρός' 'γιωργ':2 'εινα':3 'ο':1 'πονηρ':4 'γιωργ':2 'πονηρ':4
'ο ήλιος ο πράσινος o ήλιος που ανατέλλει' 'o':5 'ανατελλ':8 'ηλι':2,6 'ο':1,3 'π':7 'πρασιν':4 'ανατελλ':8 'ηλι':2,6 'πρασιν':4

There’s another previous relevant patch [0]/messages/by-id/e1c79330-48a5-abef-c309-8d4499e3180b@2ndquadrant.com but was never merged. I’ve included these stop words and added some more (info in README.md).

For my personal projects looks like it yields much better results.

I’d like some feedback on the extension ; particularly on the installation infra (I’m not sure I’ve handled properly the permissions in the .sql files)

I’ll then try to make a .patch for this.

[0]: /messages/by-id/e1c79330-48a5-abef-c309-8d4499e3180b@2ndquadrant.com

#2Peter Eisentraut
peter@eisentraut.org
In reply to: Florents Tselai (#1)
Re: Improving FTS for Greek

On 03.06.23 19:47, Florents Tselai wrote:

There’s another previous relevant patch [0] but was never merged. I’ve
included these stop words and added some more (info in README.md).

For my personal projects looks like it yields much better results.

I’d like some feedback on the extension ; particularly on the
installation infra (I’m not sure I’ve handled properly the permissions
in the .sql files)

I’ll then try to make a .patch for this.

The open question at the previous attempt was that it wasn't clear what
the upstream source or long-term maintenance of the stop words list
would be. If it's just a personally composed list, then it's okay if
you use it yourself, but for including it into PostgreSQL it ought to
come from a reputable non-individual source like snowball.

#3Florents Tselai
florents.tselai@gmail.com
In reply to: Peter Eisentraut (#2)
Re: Improving FTS for Greek

On 7 Jun 2023, at 12:13 AM, Peter Eisentraut <peter@eisentraut.org> wrote:

On 03.06.23 19:47, Florents Tselai wrote:

There’s another previous relevant patch [0] but was never merged. I’ve included these stop words and added some more (info in README.md).
For my personal projects looks like it yields much better results.
I’d like some feedback on the extension ; particularly on the installation infra (I’m not sure I’ve handled properly the permissions in the .sql files)
I’ll then try to make a .patch for this.

The open question at the previous attempt was that it wasn't clear what the upstream source or long-term maintenance of the stop words list would be. If it's just a personally composed list, then it's okay if you use it yourself, but for including it into PostgreSQL it ought to come from a reputable non-individual source like snowball.

I’ve used the NLTK list [0] as my base of stopwords; Wouldn’t this be considered reputable enough ?

0 https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip (see greek.stop file in the archive)

Show quoted text
#4Peter Eisentraut
peter@eisentraut.org
In reply to: Florents Tselai (#3)
Re: Improving FTS for Greek

On 07.06.23 00:30, Florents Tselai wrote:

On 7 Jun 2023, at 12:13 AM, Peter Eisentraut <peter@eisentraut.org> wrote:

On 03.06.23 19:47, Florents Tselai wrote:

There’s another previous relevant patch [0] but was never merged.
I’ve included these stop words and added some more (info in README.md).
For my personal projects looks like it yields much better results.
I’d like some feedback on the extension ; particularly on the
installation infra (I’m not sure I’ve handled properly the
permissions in the .sql files)
I’ll then try to make a .patch for this.

The open question at the previous attempt was that it wasn't clear
what the upstream source or long-term maintenance of the stop words
list would be.  If it's just a personally composed list, then it's
okay if you use it yourself, but for including it into PostgreSQL it
ought to come from a reputable non-individual source like snowball.

I’ve used the NLTK list [0] as my base of stopwords; Wouldn’t this be
considered reputable enough ?

0
https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip <https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip&gt; (see greek.stop file in the archive)

Who is NLTK, where did they get their stopwords file from, what is their
open source license, how do we know when to pull updates, what is the
mechanical process for pulling in those updates?