Find similar records (compare tsvectors)

Started by Patrick Dungover 11 years ago4 messagesgeneral
Jump to latest
#1Patrick Dung
patrick_dkt@yahoo.com.hk

Hello,
I had a database with articles or attachment stored in bytea format.I also had a trigger: it insert/update the tsv column when a record is added/updated.The tsv column had a GIN index.With this setting, I can do very fast keyword search on the tsv.
Suppose I had a specific record (id=100000).How to list similar records based on ranking?In that case, I had to compare a tsvector with another tsvector.
I had this SQL which make the original tsv as a text and then to tsquery, Then I can compare a tsv and a tsquery.
SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery) as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i, (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id != original.id ORDER BY similarity;
items table:id bigint
company varchar
industry varchardescription varcharpost_timestamp timestampattachment bytea
tsv tsvector

The problem is that this is very slow.Any comment?
Thank and regards,Patrick

#2Patrick Dung
patrick_dkt@yahoo.com.hk
In reply to: Patrick Dung (#1)
Re: Find similar records (compare tsvectors)

Resend.
How to quickly compare the similarity of two tsvector?

On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> wrote:

Hello,
I had a database with articles or attachment stored in bytea format.I also had a trigger: it insert/update the tsv column when a record is added/updated.The tsv column had a GIN index.With this setting, I can do very fast keyword search on the tsv.
Suppose I had a specific record (id=100000).How to list similar records based on ranking?In that case, I had to compare a tsvector with another tsvector.
I had this SQL which make the original tsv as a text and then to tsquery, Then I can compare a tsv and a tsquery.
SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery) as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i, (SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id != original.id ORDER BY similarity;
items table:id bigint
company varchar
industry varchardescription varcharpost_timestamp timestampattachment bytea
tsv tsvector

The problem is that this is very slow.Any comment?
Thank and regards,Patrick

#3Oleg Bartunov
oleg@sai.msu.su
In reply to: Patrick Dung (#2)
Re: Find similar records (compare tsvectors)

On Fri, Mar 6, 2015 at 5:05 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> wrote:

Resend.

How to quickly compare the similarity of two tsvector?

check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf

On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk>
wrote:

Hello,

I had a database with articles or attachment stored in bytea format.
I also had a trigger: it insert/update the tsv column when a record is
added/updated.
The tsv column had a GIN index.
With this setting, I can do very fast keyword search on the tsv.

Suppose I had a specific record (id=100000).
How to list similar records based on ranking?
In that case, I had to compare a tsvector with another tsvector.

I had this SQL which make the original tsv as a text and then to tsquery,
Then I can compare a tsv and a tsquery.
SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery)
as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i,
(SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id !=
original.id ORDER BY similarity;

items table:
id bigint
company varchar
industry varchar
description varchar
post_timestamp timestamp
attachment bytea
tsv tsvector

The problem is that this is very slow.
Any comment?

Thank and regards,
Patrick

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Patrick Dung
patrick_dkt@yahoo.com.hk
In reply to: Oleg Bartunov (#3)
Re: Find similar records (compare tsvectors)

Thanks.
smlar is fast and quite good.I need find tuning on the search result.

On Saturday, March 7, 2015 12:07 AM, Oleg Bartunov <obartunov@gmail.com> wrote:

On Fri, Mar 6, 2015 at 5:05 PM, Patrick Dung <patrick_dkt@yahoo.com.hk> wrote:

Resend.

How to quickly compare the similarity of two tsvector?

check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf

On Monday, March 2, 2015 11:01 PM, Patrick Dung <patrick_dkt@yahoo.com.hk>
wrote:

Hello,

I had a database with articles or attachment stored in bytea format.
I also had a trigger: it insert/update the tsv column when a record is
added/updated.
The tsv column had a GIN index.
With this setting, I can do very fast keyword search on the tsv.

Suppose I had a specific record (id=100000).
How to list similar records based on ranking?
In that case, I had to compare a tsvector with another tsvector.

I had this SQL which make the original tsv as a text and then to tsquery,
Then I can compare a tsv and a tsquery.
SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery)
as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i,
(SELECT tsv, id FROM items WHERE id=100000) AS original WHERE i.id !=
original.id ORDER BY similarity;

items table:
id bigint
company varchar
industry varchar
description varchar
post_timestamp timestamp
attachment bytea
tsv tsvector

The problem is that this is very slow.
Any comment?

Thank and regards,
Patrick

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general