how to group by similarity ?

Started by Andreasalmost 14 years ago4 messagesgeneral
Jump to latest
#1Andreas
maps.on@gmx.net

Hi,

I'm trying to get an idea about pg_trgrm.
I created a GIST index on a text column in a table.
Now I can filter the table with similarity().

How would I group the table so that it shows groups that have similarity
() > x ?

Lets say the table looks like this:

id, txt
1, aa1
2, bb1
3, cc1
4, bb2
5, bb3
6, aa2
...

How would a select look like that shows:

id, txt, group_id
1, aa1, 1,
6, aa2, 1,
2, bb1, 2,
4, bb2, 2,
5, bb3, 2,
3, cc1, 3

#2Harald Fuchs
hari.fuchs@gmail.com
In reply to: Andreas (#1)
Re: how to group by similarity ?

Andreas <maps.on@gmx.net> writes:

How would I group the table so that it shows groups that have
similarity () > x ?

Lets say the table looks like this:

id, txt
1, aa1
2, bb1
3, cc1
4, bb2
5, bb3
6, aa2
...

How would a select look like that shows:

id, txt, group_id
1, aa1, 1,
6, aa2, 1,
2, bb1, 2,
4, bb2, 2,
5, bb3, 2,
3, cc1, 3

I could only come up with this convoluted query:

WITH grp (t1, id, t2) AS (
SELECT t1.txt, t1.id, t2.txt
FROM tbl t1
LEFT JOIN tbl t2 ON t2.txt > t1.txt
WHERE t2.txt IS NULL OR similarity(t1.txt, t2.txt) > 0
)
SELECT t1, min(id)
FROM (
SELECT t1, id
FROM grp
UNION ALL
SELECT t2, id
FROM grp
WHERE t2 IS NOT NULL
) dummy
GROUP BY t1
ORDER BY t1

#3Harald Fuchs
hari.fuchs@gmail.com
In reply to: Andreas (#1)
Re: how to group by similarity ?

Andreas <maps.on@gmx.net> writes:

How would I group the table so that it shows groups that have
similarity () > x ?

Lets say the table looks like this:

id, txt
1, aa1
2, bb1
3, cc1
4, bb2
5, bb3
6, aa2
...

How would a select look like that shows:

id, txt, group_id
1, aa1, 1,
6, aa2, 1,
2, bb1, 2,
4, bb2, 2,
5, bb3, 2,
3, cc1, 3

The following query will do that, but it's convoluted:

WITH grp (t1, id, t2) AS (
SELECT t1.txt, t1.id, t2.txt
FROM tbt t1
LEFT JOIN tbt t2 ON t2.txt > t1.txt
WHERE t2.txt IS NULL OR similarity(t1.txt, t2.txt) > 0
)
SELECT t1, min(id)
FROM (
SELECT t1, id
FROM grp
UNION ALL
SELECT t2, id
FROM grp
WHERE t2 IS NOT NULL
) dummy
GROUP BY t1
ORDER BY t1

#4Tim Uckun
timuckun@gmail.com
In reply to: Harald Fuchs (#3)
Re: how to group by similarity ?

On Wed, Apr 25, 2012 at 8:34 PM, <hari.fuchs@gmail.com> wrote:

Andreas <maps.on@gmx.net> writes:

How would I group the table so that it shows groups that have
similarity () > x ?

Lets say the table looks like this:

id, txt
1, aa1
2, bb1
3, cc1
4, bb2
5, bb3
6, aa2
...

How would a select look like that shows:

id, txt, group_id
1, aa1, 1,
6, aa2, 1,
2, bb1, 2,
4, bb2, 2,
5, bb3, 2,
3, cc1, 3

Hey guys. I have a similar problem and I tried a couple of ways to
solve this including the window function described in the answer to
the original poster in this thread.

The problem I am having is that even with a trigam index and a table
with only 80,000 records the query takes forever to run. In both
cases I ended the query manually and have no idea how long it would
actually take to run. I have included the two queries below and am
hoping somebody can give me a pointer on how to accomplish with a
query that runs.

Query 1

WITH grp (t1, id, t2) AS (
SELECT t1.raw_data, t1.id, t2.raw_data
FROM schema.a t1
LEFT JOIN schema.a t2 ON t2.raw_data > t1.raw_data
WHERE t2.raw_data IS NULL OR similarity(t1.raw_data, t2.raw_data) > .75
)
SELECT t1, min(id)
FROM (
SELECT t1, id
FROM grp
UNION ALL
SELECT t2, id
FROM grp
WHERE t2 IS NOT NULL
) dummy
GROUP BY t1
ORDER BY t1

query 2

select similarity(a.raw_data,b.raw_data),*
from schema.a a, schema.a b where similarity(a.raw_data,b.raw_data) >
.75 and a.id != b.id