Grouping By Similarity (using pg_trgm)?
[pg version 9.3 or 9.4]
Suppose I have a simple table:
create table data (
my_value TEXT NOT NULL
);
CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);
Now I would like to essentially do group by to get a count of all the
values that are sufficiently similar. I can do it using something like a
CROSS JOIN to join the table on itself, but then I still am getting all the
rows with duplicate counts.
Is there a way to do a group by query and only return a single "my_value"
column and a count of the number of times other values are similar while
also not returning the included similar values in the output, too?
On Thu, May 14, 2015 at 11:58 AM, Cory Tucker <cory.tucker@gmail.com> wrote:
[pg version 9.3 or 9.4]
Suppose I have a simple table:
create table data (
my_value TEXT NOT NULL
);
CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);Now I would like to essentially do group by to get a count of all the
values that are sufficiently similar. I can do it using something like a
CROSS JOIN to join the table on itself, but then I still am getting all the
rows with duplicate counts.Is there a way to do a group by query and only return a single "my_value"
column and a count of the number of times other values are similar while
also not returning the included similar values in the output, too?
Concept below - not bothering to lookup the functions/operators for
pg_trgm:
SELECT my_value_src, count(*)
FROM (SELECT my_value AS my_value_src FROM data) src
JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
ON ( func(my_value_src, my_value_compareto) < # )
GROUP BY my_value_src
David J.
That produces pretty much the same results as the CROSS JOIN I was using
before. Because each "my_value" in the table are different, if I group on
just their value then I will always have the full result set and a bunch of
essentially duplicated results.
Any other ideas/options?
On Thu, May 14, 2015 at 12:08 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:
Show quoted text
On Thu, May 14, 2015 at 11:58 AM, Cory Tucker <cory.tucker@gmail.com>
wrote:[pg version 9.3 or 9.4]
Suppose I have a simple table:
create table data (
my_value TEXT NOT NULL
);
CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);Now I would like to essentially do group by to get a count of all the
values that are sufficiently similar. I can do it using something like a
CROSS JOIN to join the table on itself, but then I still am getting all the
rows with duplicate counts.Is there a way to do a group by query and only return a single "my_value"
column and a count of the number of times other values are similar while
also not returning the included similar values in the output, too?Concept below - not bothering to lookup the functions/operators for
pg_trgm:SELECT my_value_src, count(*)
FROM (SELECT my_value AS my_value_src FROM data) src
JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
ON ( func(my_value_src, my_value_compareto) < # )
GROUP BY my_value_srcDavid J.
On Thu, May 14, 2015 at 1:09 PM, Cory Tucker <cory.tucker@gmail.com> wrote:
That produces pretty much the same results as the CROSS JOIN I was using
before. Because each "my_value" in the table are different, if I group on
just their value then I will always have the full result set and a bunch of
essentially duplicated results.Any other ideas/options?
how do you want to solve the problem:
A is similar to B
B is similar to C
A IS NOT similar to C
?
I'm not sure that PostgreSQL is the best tool to solve clustering
problems...though my experience with them is minimal.
David J.
Have you seen http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf ?
On Thu, May 14, 2015 at 9:58 PM, Cory Tucker <cory.tucker@gmail.com> wrote:
Show quoted text
[pg version 9.3 or 9.4]
Suppose I have a simple table:
create table data (
my_value TEXT NOT NULL
);
CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);Now I would like to essentially do group by to get a count of all the
values that are sufficiently similar. I can do it using something like a
CROSS JOIN to join the table on itself, but then I still am getting all the
rows with duplicate counts.Is there a way to do a group by query and only return a single "my_value"
column and a count of the number of times other values are similar while
also not returning the included similar values in the output, too?