"Fuzzy" Matches on Nicknames

Started by Michael Sheaverover 9 years ago3 messagesgeneral
Jump to latest
#1Michael Sheaver
msheaver@me.com

Greetings,

I have two tables that are populated using large datasets from disparate external systems, and I am trying to match records by customer name between these two tables. I do not have any authoritative key, such as customerID or nationalID, by which I can match them up, and I have found many cases where the same customer has different first names in the two datasets. A sampling of the differences is as follows:

Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> Maggie

How can I build a query in PostgreSQL (v. 9.6) that will find possible matches like these on nicknames? My initial guess is that I would have to either find or build some sort of intermediary table that contains associated names like those above. Sometimes though, there will be more than matching pairs, like:

Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> William

and so forth.

Has anyone used or developed PostgreSQL queries that will find matches like these? I am running all my database queries. on my local laptops (Win7 and macOS), so performance or uptime is no issue here. I am curious to see how others in this community have creatively solved this common problem.

One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might work here, but honestly I am clueless as to how to set this up or use it in queries successfully.

Thanks,
Michael (aka Mike, aka Mikey)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2rob stone
floriparob@gmail.com
In reply to: Michael Sheaver (#1)
Re: "Fuzzy" Matches on Nicknames

Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:

Greetings,

I have two tables that are populated using large datasets from
disparate external systems, and I am trying to match records by
customer name between these two tables. I do not have any
authoritative key, such as customerID or nationalID, by which I can
match them up, and I have found many cases where the same customer
has different first names in the two datasets. A sampling of the
differences is as follows:

Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> Maggie

How can I build a query in PostgreSQL (v. 9.6) that will find
possible matches like these on nicknames? My initial guess is that I
would have to either find or build some sort of intermediary table
that contains associated names like those above. Sometimes though,
there will be more than matching pairs, like:

Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> William

and so forth.

Has anyone used or developed PostgreSQL queries that will find
matches like these? I am running all my database queries. on my local
laptops (Win7 and macOS), so performance or uptime is no issue here.
I am curious to see how others in this community have creatively
solved this common problem.

One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
work here, but honestly I am clueless as to how to set this up or use
it in queries successfully.

Thanks,
Michael (aka Mike, aka Mikey)

Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.

HTH.
Rob

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Merlin Moncure
mmoncure@gmail.com
In reply to: rob stone (#2)
Re: "Fuzzy" Matches on Nicknames

On Tue, Nov 29, 2016 at 6:56 PM, rob stone <floriparob@gmail.com> wrote:

Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:

Greetings,

I have two tables that are populated using large datasets from
disparate external systems, and I am trying to match records by
customer name between these two tables. I do not have any
authoritative key, such as customerID or nationalID, by which I can
match them up, and I have found many cases where the same customer
has different first names in the two datasets. A sampling of the
differences is as follows:

Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> Maggie

How can I build a query in PostgreSQL (v. 9.6) that will find
possible matches like these on nicknames? My initial guess is that I
would have to either find or build some sort of intermediary table
that contains associated names like those above. Sometimes though,
there will be more than matching pairs, like:

Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> William

and so forth.

Has anyone used or developed PostgreSQL queries that will find
matches like these? I am running all my database queries. on my local
laptops (Win7 and macOS), so performance or uptime is no issue here.
I am curious to see how others in this community have creatively
solved this common problem.

One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
work here, but honestly I am clueless as to how to set this up or use
it in queries successfully.

Thanks,
Michael (aka Mike, aka Mikey)

Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.

Also check out pg_trgm extension. It's better for addresses than
names, but might be something to look at depending on how things turn
up with the data.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general