Merge rows based on Levenshtein distance

Started by mongooseover 11 years ago12 messagesgeneral

Jump to latest

mongoose

akarargyris@gmail.com

over 11 years ago

I am new to PostgreSQL and I have the following table:

Name, City
"Alex", "Washington"
"Aleex1", "Washington"
"Bob", "NYC"
"Booob", "NYC"

I want to "merge" similar rows based on levenshtein distance between names
so that I have the following table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

How could I do that on PostgreSQL? Is there an SQL command for this?
Thnsls

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

over 11 years ago

In reply to: mongoose (#1)

Re: Merge rows based on Levenshtein distance

mongoose wrote

I am new to PostgreSQL and I have the following table:

Name, City
"Alex", "Washington"
"Aleex1", "Washington"
"Bob", "NYC"
"Booob", "NYC"

I want to "merge" similar rows based on levenshtein distance between names
so that I have the following table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

How could I do that on PostgreSQL? Is there an SQL command for this?
Thnsls

So you have a table of N names and you want to evaluate (N-1)^2 pairs and
then use the output of the levenshtein calculation to group them together.

SELECT
l_names.name_value,
r_names.name_value, leven[...](l_names.name_value, r_names.name_value) AS
pair_group
FROM table_of_names AS l_names
CROSS JOIN table_of_names AS r_names
WHERE l_names.name_value <> r_names.name_value
;

Feel free to add "group by city" or "WHERE substring(l_names.name_value, 0,
1) = substring(r_names.name_value, 0, 1)" since it seems you need more than
just a name-distance to generate the desired groups. You'd likely want to
add the same "substring" call to the SELECT-list and "GROUP BY" clauses...

David J.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5828847.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

mongoose

akarargyris@gmail.com

over 11 years ago

In reply to: David G. Johnston (#2)

Re: Merge rows based on Levenshtein distance

David,

Thank you for your prompt reply. I believe your answer helped a lot but it
seems I was not clear enough on my description. Basically I want a counter
(id) to show if two or more names are similar (i.e. levenshtein distance
less than 3) So in the previous example:

From this table:

Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"

to get this table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

So basically the id is a counter that starts from "1" and increments only
when there is a different name. Please notice that the table has its names
in a completely random order.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829030.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

over 11 years ago

In reply to: mongoose (#3)

Re: Merge rows based on Levenshtein distance

On Tuesday, December 2, 2014, mongoose [via PostgreSQL] <
ml-node+s1045698n5829030h54@n5.nabble.com> wrote:

David,

Thank you for your prompt reply. I believe your answer helped a lot but it
seems I was not clear enough on my description. Basically I want a counter
(id) to show if two or more names are similar (i.e. levenshtein distance
less than 3) So in the previous example:

From this table:

Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"

to get this table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

So basically the id is a counter that starts from "1" and increments only
when there is a different name. Please notice that the table has its names
in a completely random order.

Write and combine a few subqueries that use window functions (namely lag
and row_number) to identify groups, label them, and assign rows to each
group (using a between condition on a join)

Pondering some (not tested) if you identify the boundary records in a
subquery you can assign them a value of 1 while all others take on null.
In the outer query you should be able to assign groups by simply
applying the sum function over the entire result such that at each boundary
value the presence of the 1 will increment the sum while the null rows will
use the sum value from the prior row.

David J.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829041.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

pinker

pinker@onet.eu

over 11 years ago

In reply to: mongoose (#1)

Re: Merge rows based on Levenshtein distance

There is nice extension in postgres: fuzzystrmatch
<http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html> I have used
to calculate the distance. From documetation:

SELECT levenshtein_less_equal('extensive', 'exhaustive',2);

You can use it then with your group by query.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829111.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

mongoose

akarargyris@gmail.com

over 11 years ago

In reply to: David G. Johnston (#4)

Re: Merge rows based on Levenshtein distance

David,

Thanks for the useful feedback. Since I am not an experienced developer it
is too complicated for me to come up with the queries. Besides I wonder if
this is going to be efficient to do this processing on PostgreSQL.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829132.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

over 11 years ago

In reply to: mongoose (#6)

Re: Merge rows based on Levenshtein distance

I play with it when I get a chance but you should at least try to code
something.

Dave

On Wed, Dec 3, 2014 at 11:08 AM, mongoose [via PostgreSQL] <
ml-node+s1045698n5829132h16@n5.nabble.com> wrote:

David,

Thanks for the useful feedback. Since I am not an experienced developer it
is too complicated for me to come up with the queries. Besides I wonder if
this is going to be efficient to do this processing on PostgreSQL.

------------------------------
If you reply to this email, your message will be added to the discussion
below:

http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829132.html
To unsubscribe from Merge rows based on Levenshtein distance, click here
<http://postgresql.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5828841&code=ZGF2aWQuZy5qb2huc3RvbkBnbWFpbC5jb218NTgyODg0MXwtMzI2NTA0MzIx>
.
NAML
<http://postgresql.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829137.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

David G. Johnston

david.g.johnston@gmail.com

over 11 years ago

In reply to: pinker (#5)

Re: Merge rows based on Levenshtein distance

On Wed, Dec 3, 2014 at 9:14 AM, pinker [via PostgreSQL] <
ml-node+s1045698n5829111h65@n5.nabble.com> wrote:

There is nice extension in postgres: fuzzystrmatch
<http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html> I have
used to calculate the distance. From documetation:

SELECT levenshtein_less_equal('extensive', 'exhaustive',2);

You can use it then with your group by query.

Something like this - replace the substring(...) comparison with
legenshtein_less_equal(...) or whatever comparison you find applicable.

In the case below new groups are started whenever the first letter of the
value changes.

The first group would be NULL so I add a COALESCE() call to make it 0 -
subsequent groups start with 1 and increment properly.

WITH src (val) AS (
VALUES ('A1'::varchar),('A2'),('B1'),('B2'),('B3'),('C1'),('D1')
)
, grp AS (
SELECT val
, CASE WHEN
substring(val,1,1) <> substring(lag(val) OVER (ORDER BY
val),1,1)
THEN 1
ELSE NULL
END AS changed
, ROW_NUMBER() OVER (ORDER BY val) AS val_idx
FROM src
)
SELECT val, COALESCE(sum(changed) OVER (ORDER BY val_idx), 0) AS group_id
FROM grp
;

David J.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829143.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Michael Nolan

htfoot@gmail.com

over 11 years ago

In reply to: mongoose (#6)

Re: Merge rows based on Levenshtein distance

I don't think you've defined your problem very clearly.

Suppose you have 1000 names in your database. Are you planning to compare
each name to the other 999 names to see which is closest? What if two
names are equally close to a third name but not to each other, how do you
decide which is better?
--
Mike Nolan

#10

mongoose

akarargyris@gmail.com

over 11 years ago

In reply to: David G. Johnston (#8)

Re: Merge rows based on Levenshtein distance

Thanks for the help. I will give your code a try. Btw I know how to solve
this in a different language but unfortunately I am a very rookie with
databases.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829145.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

mongoose

akarargyris@gmail.com

over 11 years ago

In reply to: Michael Nolan (#9)

Re: Merge rows based on Levenshtein distance

Hi Mike,

I was planning to do something like David suggested: I would sort the rows
based on name and then I would use a window (i.e. 100 rows) to compare each
individual name to the previous 100. All I want to do is create groups of
similar rows based on some criteria.

--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829146.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

Michael Nolan

htfoot@gmail.com

over 11 years ago

In reply to: mongoose (#11)

Re: Merge rows based on Levenshtein distance

Have you considered using a soundex function to sort names into similarity
groups? In my experience it works fairly well with Western European names,
not quite as well with names from other parts of the world. It also
doesn't deal well with many nicknames (Mike instead of Michael, etc.)

--
Mike Nolan