"Fuzzy" Matches on Nicknames
Greetings,
I have two tables that are populated using large datasets from disparate external systems, and I am trying to match records by customer name between these two tables. I do not have any authoritative key, such as customerID or nationalID, by which I can match them up, and I have found many cases where the same customer has different first names in the two datasets. A sampling of the differences is as follows:
Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> Maggie
How can I build a query in PostgreSQL (v. 9.6) that will find possible matches like these on nicknames? My initial guess is that I would have to either find or build some sort of intermediary table that contains associated names like those above. Sometimes though, there will be more than matching pairs, like:
Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> William
and so forth.
Has anyone used or developed PostgreSQL queries that will find matches like these? I am running all my database queries. on my local laptops (Win7 and macOS), so performance or uptime is no issue here. I am curious to see how others in this community have creatively solved this common problem.
One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might work here, but honestly I am clueless as to how to set this up or use it in queries successfully.
Thanks,
Michael (aka Mike, aka Mikey)
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:
Greetings,
I have two tables that are populated using large datasets from
disparate external systems, and I am trying to match records by
customer name between these two tables. I do not have any
authoritative key, such as customerID or nationalID, by which I can
match them up, and I have found many cases where the same customer
has different first names in the two datasets. A sampling of the
differences is as follows:Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> MaggieHow can I build a query in PostgreSQL (v. 9.6) that will find
possible matches like these on nicknames? My initial guess is that I
would have to either find or build some sort of intermediary table
that contains associated names like those above. Sometimes though,
there will be more than matching pairs, like:Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> Williamand so forth.
Has anyone used or developed PostgreSQL queries that will find
matches like these? I am running all my database queries. on my local
laptops (Win7 and macOS), so performance or uptime is no issue here.
I am curious to see how others in this community have creatively
solved this common problem.One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
work here, but honestly I am clueless as to how to set this up or use
it in queries successfully.Thanks,
Michael (aka Mike, aka Mikey)
Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.
HTH.
Rob
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Nov 29, 2016 at 6:56 PM, rob stone <floriparob@gmail.com> wrote:
Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:Greetings,
I have two tables that are populated using large datasets from
disparate external systems, and I am trying to match records by
customer name between these two tables. I do not have any
authoritative key, such as customerID or nationalID, by which I can
match them up, and I have found many cases where the same customer
has different first names in the two datasets. A sampling of the
differences is as follows:Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> MaggieHow can I build a query in PostgreSQL (v. 9.6) that will find
possible matches like these on nicknames? My initial guess is that I
would have to either find or build some sort of intermediary table
that contains associated names like those above. Sometimes though,
there will be more than matching pairs, like:Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> Williamand so forth.
Has anyone used or developed PostgreSQL queries that will find
matches like these? I am running all my database queries. on my local
laptops (Win7 and macOS), so performance or uptime is no issue here.
I am curious to see how others in this community have creatively
solved this common problem.One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
work here, but honestly I am clueless as to how to set this up or use
it in queries successfully.Thanks,
Michael (aka Mike, aka Mikey)Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.
Also check out pg_trgm extension. It's better for addresses than
names, but might be something to look at depending on how things turn
up with the data.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general