Postgresql GROUP BY "SIMILAR" but not equal values

Started by alexandros_eover 12 years ago6 messagesgeneral

Jump to latest

alexandros_e

alexandros.ef@gmail.com

over 12 years ago

I wanted to ask you the following question to all experts here.

Let's say I have this table foo

ID|G1|T1|
1|2|ABC|
1|2|ABCD|
1|2|DEF|
1|2|DEFG|

SELECT * FROM foo
GROUP BY ID,G1,T1

RETURNS exactly the same table.

Is there a way in SQL or PostgreSQL in general to group by values than are
not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
some distance function (levenshtein for example) if the distance is within
some threshold (i.e., 1)

My intuition is that SQL cannot support such queries but I was wondering if
there was some hack around it. The problem as I see it that distance
functions require 2 values but GROUP BY only checks equality.

Another subproblem that might help is can we overload an operator of a
custom type, so that equals operator is more "relaxed" and is calculated by
a function? OR can we use GROUP BY with a custom comparator for a data
type?

I hope my question makes sense.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: alexandros_e (#1)

Re: Postgresql GROUP BY "SIMILAR" but not equal values

alexandros_e <alexandros.ef@gmail.com> writes:

Is there a way in SQL or PostgreSQL in general to group by values than are
not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
some distance function (levenshtein for example) if the distance is within
some threshold (i.e., 1)

Well, you can GROUP BY the result of a function.

You are going to have to think harder than the above in any case.
For example, it's not hard to imagine a "similarity" operator that
says that A is similar to B, and B is similar to C, but if you ask
it to compare A to C it says they're not similar (enough). Now what?
Are A,B,C all part of the same group? If you take the transitive
closure of such an operator you probably end up with everything in
one group; but if you don't, it's hard to see a principled result
at all.

If you can cast your problem as transformation of the values into
some canonical or representative form, then you can do that and then
group on simple equality of the canonical values. For instance
case-insensitive grouping is customarily done with

GROUP BY lower(x)

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys

haramrae@gmail.com

over 12 years ago

In reply to: alexandros_e (#1)

Re: Postgresql GROUP BY "SIMILAR" but not equal values

On 6 February 2014 16:18, alexandros_e <alexandros.ef@gmail.com> wrote:

Let's say I have this table foo

ID|G1|T1|
1|2|ABC|
1|2|ABCD|
1|2|DEF|
1|2|DEFG|

SELECT * FROM foo
GROUP BY ID,G1,T1

Is there a way in SQL or PostgreSQL in general to group by values than are
not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
some distance function (levenshtein for example) if the distance is within
some threshold (i.e., 1)

Perhaps there is: You can calculate the levenshtein distance between
those values using a self-join and then GROUP BY the result of that
expression and limit the results with HAVING.

For example:
SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
FROM foo foo1
INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
HAVING levenshtein(foo1.T1, foo2.T1) > 1

Is that what you're looking for?

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Sergey Konoplev

gray.ru@gmail.com

over 12 years ago

In reply to: Alban Hertroys (#3)

Re: Postgresql GROUP BY "SIMILAR" but not equal values

On Thu, Feb 6, 2014 at 7:41 AM, Alban Hertroys <haramrae@gmail.com> wrote:

On 6 February 2014 16:18, alexandros_e <alexandros.ef@gmail.com> wrote:

Is there a way in SQL or PostgreSQL in general to group by values than are
not exactly the same but are quite similar (like 'ABC' and 'ABCD') based on
some distance function (levenshtein for example) if the distance is within
some threshold (i.e., 1)

Perhaps there is: You can calculate the levenshtein distance between
those values using a self-join and then GROUP BY the result of that
expression and limit the results with HAVING.

For example:
SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
FROM foo foo1
INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1)
HAVING levenshtein(foo1.T1, foo2.T1) > 1

From my understanding of the question, probably, adding another
levenshtein distance to some base value will make more sense:

GROUP BY levenshtein(foo1.T1, foo2.T1), levenshtein(foo1.T1, 'A')

Though, it looks like a clusterization task for me, and therefore I
would recommend OP to look at the PL/R
http://www.joeconway.com/plr/doc/index.html.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Gauthier, Dave

dave.gauthier@intel.com

over 12 years ago

In reply to: Tom Lane (#2)

Re: Postgresql GROUP BY "SIMILAR" but not equal values

What about a regexp match ?

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Thursday, February 06, 2014 10:32 AM
To: alexandros_e
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Postgresql GROUP BY "SIMILAR" but not equal values

alexandros_e <alexandros.ef@gmail.com> writes:

Is there a way in SQL or PostgreSQL in general to group by values than
are not exactly the same but are quite similar (like 'ABC' and 'ABCD')
based on some distance function (levenshtein for example) if the
distance is within some threshold (i.e., 1)

Well, you can GROUP BY the result of a function.

You are going to have to think harder than the above in any case.
For example, it's not hard to imagine a "similarity" operator that says that A is similar to B, and B is similar to C, but if you ask it to compare A to C it says they're not similar (enough). Now what?
Are A,B,C all part of the same group? If you take the transitive closure of such an operator you probably end up with everything in one group; but if you don't, it's hard to see a principled result at all.

If you can cast your problem as transformation of the values into some canonical or representative form, then you can do that and then group on simple equality of the canonical values. For instance case-insensitive grouping is customarily done with

GROUP BY lower(x)

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

alexandros_e

alexandros.ef@gmail.com

over 12 years ago

In reply to: Alban Hertroys (#3)

Re: Postgresql GROUP BY "SIMILAR" but not equal values

Conceptually, Tom (as always) is right. But Alban's hack help.

DROP TABLE foo;
CREATE TABLE IF NOT EXISTS foo(ID INTEGER,G1 INTEGER, T1 TEXT, ID2 SERIAL
PRIMARY KEY);

INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABC');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABCD');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'ABDC');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEF');
INSERT INTO foo(ID,G1,T1) VALUES(1,2,'DEFH');

/* A little editing to remove duplicates a to b and b to a */
SELECT foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1
FROM foo foo1
INNER JOIN foo foo2 ON (foo2.ID = foo1.ID AND foo2.G1 = foo1.G1)
WHERE foo1.ID2<foo2.ID2
GROUP BY foo1.ID, foo1.G1, foo1.T1, levenshtein(foo1.T1, foo2.T1),foo2.T1
HAVING levenshtein(foo1.T1, foo2.T1) <2;

RETURNS

ID|G1|foo1.T1|foo2.T1
1;2;"ABC";1;"ABCD"
1;2;"ABC";1;"ABDC"
1;2;"ABCD";2;"ABDC"
1;2;"DEF";1;"DEFH"

Then it requires a second grouping but as Tom suggested it would be hard to
somehow group all similar cases together because then it becomes a
clustering problem. With a second grouping we will have 3 records instead of
4, so it is better than the initial case by 25%.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Postgresql-GROUP-BY-SIMILAR-but-not-equal-values-tp5790860p5790876.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general