Bug in metaphone (contrib/fuzzystrmatch)
Second argument to metaphone is suposed to set the limit on the number
of characters to return, but it breaks on some phrases:
usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from (select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLT
usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from (select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRL
In every case I've found that does this, the 4th and 5th letters are
always 'KS'.
--
Jim C. Nasby (aka Decibel!) jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
Jim C. Nasby wrote:
Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLTusps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRLIn every case I've found that does this, the 4th and 5th letters are
always 'KS'.
Nice catch.
There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:
regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)
regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)
Please apply.
Thanks,
Joe
Attachments:
fuzzystrmatch-fix.patchtext/plain; name=fuzzystrmatch-fix.patchDownload+18-21
Great, I'll try it right away. I was also wondering why you have the
function bomb if it's fed an empty string or a NULL? It seems it would
be much nicer to have it return and empty string/NULL, respectively.
--
Jim C. Nasby (aka Decibel!) jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
(I never saw this make it to the list yesterday, so I'm resending to
patches)
Jim C. Nasby wrote:
Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLTusps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRLIn every case I've found that does this, the 4th and 5th letters are
always 'KS'.
Nice catch.
There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:
regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)
regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)
Please apply.
Thanks,
Joe
Attachments:
fuzzystrmatch-fix.patchtext/plain; name=fuzzystrmatch-fix.patchDownload+18-21
Joe Conway wrote:
(I never saw this make it to the list yesterday, so I'm resending to
patches)Jim C. Nasby wrote:
Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLTusps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRLIn every case I've found that does this, the 4th and 5th letters are
always 'KS'.Nice catch.
There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)Please apply.
Thanks,
Joe
Index: contrib/fuzzystrmatch/README.fuzzystrmatch =================================================================== RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/README.fuzzystrmatch,v retrieving revision 1.2 diff -c -r1.2 README.fuzzystrmatch *** contrib/fuzzystrmatch/README.fuzzystrmatch 7 Aug 2001 18:16:01 -0000 1.2 --- contrib/fuzzystrmatch/README.fuzzystrmatch 6 Jun 2003 16:37:54 -0000 *************** *** 3,9 **** * * Functions for "fuzzy" comparison of strings * ! * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001; * * levenshtein() * ------------- --- 3,12 ---- * * Functions for "fuzzy" comparison of strings * ! * Joe Conway <mail@joeconway.com> ! * ! * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group ! * ALL RIGHTS RESERVED; * * levenshtein() * ------------- Index: contrib/fuzzystrmatch/fuzzystrmatch.c =================================================================== RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.c,v retrieving revision 1.7 diff -c -r1.7 fuzzystrmatch.c *** contrib/fuzzystrmatch/fuzzystrmatch.c 10 Mar 2003 22:28:17 -0000 1.7 --- contrib/fuzzystrmatch/fuzzystrmatch.c 6 Jun 2003 16:38:06 -0000 *************** *** 3,9 **** * * Functions for "fuzzy" comparison of strings * ! * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001; * * levenshtein() * ------------- --- 3,12 ---- * * Functions for "fuzzy" comparison of strings * ! * Joe Conway <mail@joeconway.com> ! * ! * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group ! * ALL RIGHTS RESERVED; * * levenshtein() * ------------- *************** *** 221,229 **** if (!(reqlen > 0)) elog(ERROR, "metaphone: Requested Metaphone output length must be > 0");- metaph = palloc(reqlen); - memset(metaph, '\0', reqlen); - retval = _metaphone(str_i, reqlen, &metaph); if (retval == META_SUCCESS) { --- 224,229 ---- *************** *** 629,635 **** /* KS */ case 'X': Phonize('K'); ! Phonize('S'); break; /* Y if followed by a vowel */ case 'Y': --- 629,636 ---- /* KS */ case 'X': Phonize('K'); ! if (max_phonemes == 0 || Phone_Len < max_phonemes) ! Phonize('S'); break; /* Y if followed by a vowel */ case 'Y': Index: contrib/fuzzystrmatch/fuzzystrmatch.h =================================================================== RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.h,v retrieving revision 1.6 diff -c -r1.6 fuzzystrmatch.h *** contrib/fuzzystrmatch/fuzzystrmatch.h 5 Sep 2002 00:43:06 -0000 1.6 --- contrib/fuzzystrmatch/fuzzystrmatch.h 6 Jun 2003 16:38:13 -0000 *************** *** 3,9 **** * * Functions for "fuzzy" comparison of strings * ! * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001; * * levenshtein() * ------------- --- 3,12 ---- * * Functions for "fuzzy" comparison of strings * ! * Joe Conway <mail@joeconway.com> ! * ! * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group ! * ALL RIGHTS RESERVED; * * levenshtein() * -------------
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Attachments:
/wrk/email/patchestext/plainDownload
Patch applied. Thanks.
---------------------------------------------------------------------------
Joe Conway wrote:
(I never saw this make it to the list yesterday, so I'm resending to
patches)Jim C. Nasby wrote:
Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLTusps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRLIn every case I've found that does this, the 4th and 5th letters are
always 'KS'.Nice catch.
There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)Please apply.
Thanks,
Joe
Index: contrib/fuzzystrmatch/README.fuzzystrmatch =================================================================== RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/README.fuzzystrmatch,v retrieving revision 1.2 diff -c -r1.2 README.fuzzystrmatch *** contrib/fuzzystrmatch/README.fuzzystrmatch 7 Aug 2001 18:16:01 -0000 1.2 --- contrib/fuzzystrmatch/README.fuzzystrmatch 6 Jun 2003 16:37:54 -0000 *************** *** 3,9 **** * * Functions for "fuzzy" comparison of strings * ! * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001; * * levenshtein() * ------------- --- 3,12 ---- * * Functions for "fuzzy" comparison of strings * ! * Joe Conway <mail@joeconway.com> ! * ! * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group ! * ALL RIGHTS RESERVED; * * levenshtein() * ------------- Index: contrib/fuzzystrmatch/fuzzystrmatch.c =================================================================== RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.c,v retrieving revision 1.7 diff -c -r1.7 fuzzystrmatch.c *** contrib/fuzzystrmatch/fuzzystrmatch.c 10 Mar 2003 22:28:17 -0000 1.7 --- contrib/fuzzystrmatch/fuzzystrmatch.c 6 Jun 2003 16:38:06 -0000 *************** *** 3,9 **** * * Functions for "fuzzy" comparison of strings * ! * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001; * * levenshtein() * ------------- --- 3,12 ---- * * Functions for "fuzzy" comparison of strings * ! * Joe Conway <mail@joeconway.com> ! * ! * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group ! * ALL RIGHTS RESERVED; * * levenshtein() * ------------- *************** *** 221,229 **** if (!(reqlen > 0)) elog(ERROR, "metaphone: Requested Metaphone output length must be > 0");- metaph = palloc(reqlen); - memset(metaph, '\0', reqlen); - retval = _metaphone(str_i, reqlen, &metaph); if (retval == META_SUCCESS) { --- 224,229 ---- *************** *** 629,635 **** /* KS */ case 'X': Phonize('K'); ! Phonize('S'); break; /* Y if followed by a vowel */ case 'Y': --- 629,636 ---- /* KS */ case 'X': Phonize('K'); ! if (max_phonemes == 0 || Phone_Len < max_phonemes) ! Phonize('S'); break; /* Y if followed by a vowel */ case 'Y': Index: contrib/fuzzystrmatch/fuzzystrmatch.h =================================================================== RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.h,v retrieving revision 1.6 diff -c -r1.6 fuzzystrmatch.h *** contrib/fuzzystrmatch/fuzzystrmatch.h 5 Sep 2002 00:43:06 -0000 1.6 --- contrib/fuzzystrmatch/fuzzystrmatch.h 6 Jun 2003 16:38:13 -0000 *************** *** 3,9 **** * * Functions for "fuzzy" comparison of strings * ! * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001; * * levenshtein() * ------------- --- 3,12 ---- * * Functions for "fuzzy" comparison of strings * ! * Joe Conway <mail@joeconway.com> ! * ! * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group ! * ALL RIGHTS RESERVED; * * levenshtein() * -------------
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073