Full Text Search combined with Fuzzy

Started by Nicolas Parisabout 9 years ago6 messagesgeneral
Jump to latest
#1Nicolas Paris
niparisco@gmail.com

Hello,

AFAIK there is no built-in way to combine full text search and fuzzy matching
(https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
By example, phrase searching with tipos in it.

First I don't know if postgresql concurrents (lucene based...) are able
to do so.

Second, is such feature is in the road map ?

Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such prupose.(https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:

postgres pgsql
postgresql pgsql
postgrez pgsql
postgre pgsql
gogle googl
gooogle googl

There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?

Thanks by advance for you answers,

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Nicolas Paris (#1)
Re: Full Text Search combined with Fuzzy

On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco@gmail.com> wrote:

Hello,

AFAIK there is no built-in way to combine full text search and fuzzy
matching
(https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
By example, phrase searching with tipos in it.

First I don't know if postgresql concurrents (lucene based...) are able
to do so.

Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used
for this.

Second, is such feature is in the road map ?

Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such prupose.(https://www.postgresql.org/docs/current/
static/textsearch-dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:

postgres pgsql
postgresql pgsql
postgrez pgsql
postgre pgsql
gogle googl
gooogle googl

There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?

dictionary is just a program, so it's up to developer how to write
efficient program to deal with billion entries. Specifically to synonym
dictionary, it's not intended to work with a lot of entries. btw, have a
look on contrib/dict_xsyn dictionary, which is more flexible than synonym.

Show quoted text

Thanks by advance for you answers,

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Nicolas Paris
niparisco@gmail.com
In reply to: Oleg Bartunov (#2)
Re: Full Text Search combined with Fuzzy

Le 27 f�vr. 2017 � 10:32, Oleg Bartunov �crivait :

On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco@gmail.com> wrote:

Hello,

AFAIK there is no built-in way to combine full text search and fuzzy
matching
(https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
By example, phrase searching with tipos in it.

First I don't know if postgresql concurrents (lucene based...) are able
to do so.

Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used for
this.

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
texts | similarity
-------------------------------------------+------------
blah blah blah hello world blah blah blah | 0.473684
blah blah blah hello word blah blah blah | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
QUERY PLAN
-----------------------------------------------------------------------------------
Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32)
Recheck Cond: (texts % 'hello word'::text)
-> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0)
Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !

Second, is such feature is in the road map ?

Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such prupose.(https://www.postgresql.org/docs/current/static/
textsearch-dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:

postgres� � � � pgsql
postgresql� � � pgsql
postgrez� � � � pgsql
postgre� � � � �pgsql
gogle� � � � � �googl
gooogle� � � � �googl

There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?

dictionary is just a program, so it's� up to developer how to write efficient
program to deal with billion entries. Specifically to synonym dictionary, it's
not intended to work with a lot of entries. btw, have a look on contrib/
dict_xsyn dictionary, which is more flexible than synonym.

Thanks by advance for you answers,

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Arthur Zakirov
a.zakirov@postgrespro.ru
In reply to: Nicolas Paris (#3)
Re: Full Text Search combined with Fuzzy

On 03.03.2017 15:49, Nicolas Paris wrote:

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
texts | similarity
-------------------------------------------+------------
blah blah blah hello world blah blah blah | 0.473684
blah blah blah hello word blah blah blah | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
QUERY PLAN
-----------------------------------------------------------------------------------
Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32)
Recheck Cond: (texts % 'hello word'::text)
-> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0)
Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !

Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
example:

postgres=# SELECT texts, word_similarity('hello word', texts) FROM
test_trgm WHERE 'hello word' <% texts;
texts | word_similarity
-------------------------------------------+-----------------
blah blah blah hello world blah blah blah | 0.818182
blah blah blah hello word blah blah blah | 1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Nicolas Paris
niparisco@gmail.com
In reply to: Arthur Zakirov (#4)
Re: Full Text Search combined with Fuzzy

Le 03 mars 2017 � 14:08, Artur Zakirov �crivait :

On 03.03.2017 15:49, Nicolas Paris wrote:

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
texts | similarity
-------------------------------------------+------------
blah blah blah hello world blah blah blah | 0.473684
blah blah blah hello word blah blah blah | 0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
QUERY PLAN
-----------------------------------------------------------------------------------
Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32)
Recheck Cond: (texts % 'hello word'::text)
-> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0)
Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !

Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
example:

postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm
WHERE 'hello word' <% texts;
texts | word_similarity
-------------------------------------------+-----------------
blah blah blah hello world blah blah blah | 0.818182
blah blah blah hello word blah blah blah | 1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.

The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?

Is there any possibility in the future to add typo in the full text
road-map ?

Thanks,

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Arthur Zakirov
a.zakirov@postgrespro.ru
In reply to: Nicolas Paris (#5)
Re: Full Text Search combined with Fuzzy

On 03.03.2017 16:17, Nicolas Paris wrote:

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.

You are right. I think pg_trgm will be not good for such large texts,
unfortunately.

The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?

I suppose there are no other options now. Though, prefix search maybe
will help you [1].

Is there any possibility in the future to add typo in the full text
road-map ?

As far as I know, there is no plans in the near future to add similarity
full text search.

1.
https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general