Similarity search with the tsearch2 extension

Started by Janek Sendrowskiover 12 years ago3 messagesgeneral
Jump to latest
#1Janek Sendrowski
janek12@web.de

Hi,
 
I want to realize a Full Text Search with the tsearch2 extension. It should find similar sentences.
 
I used my own trigger to store the tsvector of the sentences and I created a usual gist index on them.
I have to to use many OR statements with a low set of arguments, what heavy damages the performance.
My former query looked like this:
 
SELECT strip(to_tsvector('The tiger is the largest cat species, reaching a total body length of up to 3.3 m and weighing up to 306 kg.'));
strip
----------------------------------------------------------------------------------------------
'3.3' '306' 'bodi' 'cat' 'kg' 'largest' 'length' 'm' 'reach' 'speci' 'tiger' 'total' 'weigh'
(1 row)
 
SELECT * FROM tablename WHERE vector @@ to_tsquery('speci & tiger & total & weigh') AND vector @@ to_tsquery('largest & length & m & reach')  AND vector @@ to_tsquery('3.3 & 306 & bodi & cat & kg');

And thats very slow.
Is there a better solution like a functional index?

Thank you for your help.
 
Janek Sendrowski

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Janek Sendrowski (#1)
Re: Similarity search with the tsearch2 extension

Janek Sendrowski <janek12@web.de> wrote:

I want to realize a Full Text Search with the tsearch2 extension.
It should find similar sentences.
 
I used my own trigger to store the tsvector of the sentences and
I created a usual gist index on them.
I have to to use many OR statements with a low set of arguments,
what heavy damages the performance.
My former query looked like this:

SELECT * FROM tablename
   WHERE vector @@ to_tsquery('speci & tiger & total & weigh')
     AND vector @@ to_tsquery('largest & length & m & reach')
     AND vector @@ to_tsquery('3.3 & 306 & bodi & cat & kg');

I don't see any OR operators there.

And thats very slow.

Are you sure it is using the index?

Anyway, it is better to show an example, with EXPLAIN ANALYZE
output.  Here's mine, involving searches of War and Peace.

test=# -- Create the table.
test=# -- In reality, I would probably make tsv NOT NULL,
test=# -- but I'm keeping the example simple...
test=# CREATE TABLE war_and_peace
test-#   (
test(#     lineno serial PRIMARY KEY,
test(#     linetext text NOT NULL,
test(#     tsv tsvector
test(#   );
CREATE TABLE
test=#
test=# -- Load from downloaded data into database.
test=# COPY war_and_peace (linetext)
test-#   FROM '/home/kgrittn/Downloads/war-and-peace.txt';
COPY 65007
test=#
test=# -- "Digest" data to lexemes.
test=# UPDATE war_and_peace
test-#   SET tsv = to_tsvector('english', linetext);
UPDATE 65007
test=#
test=# -- Index the lexemes using GIN.
test=# CREATE INDEX war_and_peace_tsv
test-#   ON war_and_peace
test-#   USING gin (tsv);
CREATE INDEX
test=#
test=# -- Make sure the database has statistics.
test=# VACUUM ANALYZE war_and_peace;
VACUUM
test=#
test=# -- Find lines with "gentlemen".
test=# EXPLAIN ANALYZE
test-# SELECT * FROM war_and_peace
test-#   WHERE tsv @@ to_tsquery('english', 'gentlemen');
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on war_and_peace  (cost=12.52..240.18 rows=67 width=115) (actual time=0.058..0.130 rows=84 loops=1)
   Recheck Cond: (tsv @@ '''gentlemen'''::tsquery)
   ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..12.50 rows=67 width=0) (actual time=0.045..0.045 rows=84 loops=1)
         Index Cond: (tsv @@ '''gentlemen'''::tsquery)
 Total runtime: 0.160 ms
(5 rows)

test=#
test=# -- Find lines with "ladies".
test=# EXPLAIN ANALYZE
test-# SELECT * FROM war_and_peace
test-#   WHERE tsv @@ to_tsquery('english', 'ladies');
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on war_and_peace  (cost=13.39..547.24 rows=180 width=115) (actual time=0.062..0.215 rows=184 loops=1)
   Recheck Cond: (tsv @@ '''ladi'''::tsquery)
   ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..13.35 rows=180 width=0) (actual time=0.043..0.043 rows=184 loops=1)
         Index Cond: (tsv @@ '''ladi'''::tsquery)
 Total runtime: 0.247 ms
(5 rows)

test=#
test=# -- Find lines with "ladies" and "gentlemen".
test=# EXPLAIN ANALYZE
test-# SELECT * FROM war_and_peace
test-#   WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen');
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on war_and_peace  (cost=20.00..24.01 rows=1 width=115) (actual time=0.062..0.063 rows=1 loops=1)
   Recheck Cond: (tsv @@ '''ladi'' & ''gentlemen'''::tsquery)
   ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..20.00 rows=1 width=0) (actual time=0.057..0.057 rows=1 loops=1)
         Index Cond: (tsv @@ '''ladi'' & ''gentlemen'''::tsquery)
 Total runtime: 0.090 ms
(5 rows)

test=#
test=# -- Find lines with ("ladies" and "gentlemen") and ("provinces" and "distance").
test=# EXPLAIN ANALYZE
test-# SELECT * FROM war_and_peace
test-#   WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen')
test-#     AND tsv @@ to_tsquery('english', 'provinces & distance');
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on war_and_peace  (cost=36.00..40.02 rows=1 width=115) (actual time=0.100..0.100 rows=1 loops=1)
   Recheck Cond: ((tsv @@ '''ladi'' & ''gentlemen'''::tsquery) AND (tsv @@ '''provinc'' & ''distanc'''::tsquery))
   ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..36.00 rows=1 width=0) (actual time=0.095..0.095 rows=1 loops=1)
         Index Cond: ((tsv @@ '''ladi'' & ''gentlemen'''::tsquery) AND (tsv @@ '''provinc'' & ''distanc'''::tsquery))
 Total runtime: 0.130 ms
(5 rows)

test=#
test=# -- Find lines with ("ladies" or "gentlemen") and ("provinces" or "distance").
test=# EXPLAIN ANALYZE
test-# SELECT * FROM war_and_peace
test-#   WHERE tsv @@ to_tsquery('english', 'ladies | gentlemen')
test-#     AND tsv @@ to_tsquery('english', 'provinces | distance');
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on war_and_peace  (cost=36.00..40.02 rows=1 width=115) (actual time=0.043..0.043 rows=1 loops=1)
   Recheck Cond: ((tsv @@ '''ladi'' | ''gentlemen'''::tsquery) AND (tsv @@ '''provinc'' | ''distanc'''::tsquery))
   ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..36.00 rows=1 width=0) (actual time=0.042..0.042 rows=1 loops=1)
         Index Cond: ((tsv @@ '''ladi'' | ''gentlemen'''::tsquery) AND (tsv @@ '''provinc'' | ''distanc'''::tsquery))
 Total runtime: 0.056 ms
(5 rows)

test=#
test=# -- Find lines with ("ladies" and "gentlemen") or ("provinces" and "distance").
test=# EXPLAIN ANALYZE
test-# SELECT * FROM war_and_peace
test-#   WHERE tsv @@ to_tsquery('english', 'ladies & gentlemen')
test-#      OR tsv @@ to_tsquery('english', 'provinces & distance');
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on war_and_peace  (cost=40.00..44.02 rows=1 width=115) (actual time=0.080..0.080 rows=1 loops=1)
   Recheck Cond: ((tsv @@ '''ladi'' & ''gentlemen'''::tsquery) OR (tsv @@ '''provinc'' & ''distanc'''::tsquery))
   ->  BitmapOr  (cost=40.00..40.00 rows=1 width=0) (actual time=0.076..0.076 rows=0 loops=1)
         ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..20.00 rows=1 width=0) (actual time=0.052..0.052 rows=1 loops=1)
               Index Cond: (tsv @@ '''ladi'' & ''gentlemen'''::tsquery)
         ->  Bitmap Index Scan on war_and_peace_tsv  (cost=0.00..20.00 rows=1 width=0) (actual time=0.024..0.024 rows=1 loops=1)
               Index Cond: (tsv @@ '''provinc'' & ''distanc'''::tsquery)
 Total runtime: 0.116 ms
(8 rows)

Can you provide a similar example which slows the slowness you report?
                                                                                                                                                                                                                                                                               
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

#3Janek Sendrowski
janek12@web.de
In reply to: Kevin Grittner (#2)
Re: [PERFORM] Similarity search with the tsearch2 extension

Sorry, I used AND-statements instead of OR-statement in the example.
I notices that gin is much faster than gist, but I don't know why.

The query gets slow, because there are many non-stop words which appear very often in my sentences, like in 3% of all the sentences.
Do you think it could be worth it to filter the words, which appears that often and declare them as stop-words.
How would you split a sentence with let's say 10 non stop words to provide a performed similarity search?
 
There's still the problem with very short sentences. An partiel index on them with the trigram search might be the solution.
The pg_trgm module is far to slow for bigger setences, like you showed.
 
I thought I'll build a few partiel indexes on the string length, to enhance the performance.
Do you know some more improvements?
 
Janek Sendrowki

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general