Bug in metaphone (contrib/fuzzystrmatch)

Started by Jim Nasbyalmost 23 years ago6 messagesgeneral
Jump to latest
#1Jim Nasby
Jim.Nasby@BlueTreble.com

Second argument to metaphone is suposed to set the limit on the number
of characters to return, but it breaks on some phrases:

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from (select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLT

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from (select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRL

In every case I've found that does this, the 4th and 5th letters are
always 'KS'.
--
Jim C. Nasby (aka Decibel!) jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#2Joe Conway
mail@joeconway.com
In reply to: Jim Nasby (#1)
Re: Bug in metaphone (contrib/fuzzystrmatch)

Jim C. Nasby wrote:

Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLT

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRL

In every case I've found that does this, the 4th and 5th letters are
always 'KS'.

Nice catch.

There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:

regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)

regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)

Please apply.

Thanks,

Joe

Attachments:

fuzzystrmatch-fix.patchtext/plain; name=fuzzystrmatch-fix.patchDownload+18-21
#3Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Joe Conway (#2)
Re: Bug in metaphone (contrib/fuzzystrmatch)

Great, I'll try it right away. I was also wondering why you have the
function bomb if it's fed an empty string or a NULL? It seems it would
be much nicer to have it return and empty string/NULL, respectively.
--
Jim C. Nasby (aka Decibel!) jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#4Joe Conway
mail@joeconway.com
In reply to: Jim Nasby (#1)
Re: [GENERAL] Bug in metaphone (contrib/fuzzystrmatch)

(I never saw this make it to the list yesterday, so I'm resending to
patches)

Jim C. Nasby wrote:

Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLT

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRL

In every case I've found that does this, the 4th and 5th letters are
always 'KS'.

Nice catch.

There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:

regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)

regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)

Please apply.

Thanks,

Joe

Attachments:

fuzzystrmatch-fix.patchtext/plain; name=fuzzystrmatch-fix.patchDownload+18-21
#5Bruce Momjian
bruce@momjian.us
In reply to: Joe Conway (#4)
Re: [GENERAL] Bug in metaphone (contrib/fuzzystrmatch)

Joe Conway wrote:

(I never saw this make it to the list yesterday, so I'm resending to
patches)

Jim C. Nasby wrote:

Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLT

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRL

In every case I've found that does this, the 4th and 5th letters are
always 'KS'.

Nice catch.

There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:

regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)

regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)

Please apply.

Thanks,

Joe

Index: contrib/fuzzystrmatch/README.fuzzystrmatch
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/README.fuzzystrmatch,v
retrieving revision 1.2
diff -c -r1.2 README.fuzzystrmatch
*** contrib/fuzzystrmatch/README.fuzzystrmatch	7 Aug 2001 18:16:01 -0000	1.2
--- contrib/fuzzystrmatch/README.fuzzystrmatch	6 Jun 2003 16:37:54 -0000
***************
*** 3,9 ****
*
* Functions for "fuzzy" comparison of strings
*
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
--- 3,12 ----
*
* Functions for "fuzzy" comparison of strings
*
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
*
* levenshtein()
* -------------
Index: contrib/fuzzystrmatch/fuzzystrmatch.c
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.c,v
retrieving revision 1.7
diff -c -r1.7 fuzzystrmatch.c
*** contrib/fuzzystrmatch/fuzzystrmatch.c	10 Mar 2003 22:28:17 -0000	1.7
--- contrib/fuzzystrmatch/fuzzystrmatch.c	6 Jun 2003 16:38:06 -0000
***************
*** 3,9 ****
*
* Functions for "fuzzy" comparison of strings
*
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
--- 3,12 ----
*
* Functions for "fuzzy" comparison of strings
*
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
*
* levenshtein()
* -------------
***************
*** 221,229 ****
if (!(reqlen > 0))
elog(ERROR, "metaphone: Requested Metaphone output length must be > 0");
- 	metaph = palloc(reqlen);
- 	memset(metaph, '\0', reqlen);
- 
retval = _metaphone(str_i, reqlen, &metaph);
if (retval == META_SUCCESS)
{
--- 224,229 ----
***************
*** 629,635 ****
/* KS */
case 'X':
Phonize('K');
! 				Phonize('S');
break;
/* Y if followed by a vowel */
case 'Y':
--- 629,636 ----
/* KS */
case 'X':
Phonize('K');
! 				if (max_phonemes == 0 || Phone_Len < max_phonemes)
! 					Phonize('S');
break;
/* Y if followed by a vowel */
case 'Y':
Index: contrib/fuzzystrmatch/fuzzystrmatch.h
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.h,v
retrieving revision 1.6
diff -c -r1.6 fuzzystrmatch.h
*** contrib/fuzzystrmatch/fuzzystrmatch.h	5 Sep 2002 00:43:06 -0000	1.6
--- contrib/fuzzystrmatch/fuzzystrmatch.h	6 Jun 2003 16:38:13 -0000
***************
*** 3,9 ****
*
* Functions for "fuzzy" comparison of strings
*
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
--- 3,12 ----
*
* Functions for "fuzzy" comparison of strings
*
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
*
* levenshtein()
* -------------

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Attachments:

/wrk/email/patchestext/plainDownload
#6Bruce Momjian
bruce@momjian.us
In reply to: Joe Conway (#4)
Re: [GENERAL] Bug in metaphone (contrib/fuzzystrmatch)

Patch applied. Thanks.

---------------------------------------------------------------------------

Joe Conway wrote:

(I never saw this make it to the list yesterday, so I'm resending to
patches)

Jim C. Nasby wrote:

Second argument to metaphone is suposed to set the limit on the
number of characters to return, but it breaks on some phrases:

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'Hello world'::varchar AS a) a;
HLW | HLWR | HLWRLT

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
(select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
AKM | AKMKS | AKMKSMMRL

In every case I've found that does this, the 4th and 5th letters are
always 'KS'.

Nice catch.

There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:

regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMK
(1 row)

regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
metaphone
-----------
AKMKS
(1 row)

Please apply.

Thanks,

Joe

Index: contrib/fuzzystrmatch/README.fuzzystrmatch
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/README.fuzzystrmatch,v
retrieving revision 1.2
diff -c -r1.2 README.fuzzystrmatch
*** contrib/fuzzystrmatch/README.fuzzystrmatch	7 Aug 2001 18:16:01 -0000	1.2
--- contrib/fuzzystrmatch/README.fuzzystrmatch	6 Jun 2003 16:37:54 -0000
***************
*** 3,9 ****
*
* Functions for "fuzzy" comparison of strings
*
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
--- 3,12 ----
*
* Functions for "fuzzy" comparison of strings
*
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
*
* levenshtein()
* -------------
Index: contrib/fuzzystrmatch/fuzzystrmatch.c
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.c,v
retrieving revision 1.7
diff -c -r1.7 fuzzystrmatch.c
*** contrib/fuzzystrmatch/fuzzystrmatch.c	10 Mar 2003 22:28:17 -0000	1.7
--- contrib/fuzzystrmatch/fuzzystrmatch.c	6 Jun 2003 16:38:06 -0000
***************
*** 3,9 ****
*
* Functions for "fuzzy" comparison of strings
*
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
--- 3,12 ----
*
* Functions for "fuzzy" comparison of strings
*
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
*
* levenshtein()
* -------------
***************
*** 221,229 ****
if (!(reqlen > 0))
elog(ERROR, "metaphone: Requested Metaphone output length must be > 0");
- 	metaph = palloc(reqlen);
- 	memset(metaph, '\0', reqlen);
- 
retval = _metaphone(str_i, reqlen, &metaph);
if (retval == META_SUCCESS)
{
--- 224,229 ----
***************
*** 629,635 ****
/* KS */
case 'X':
Phonize('K');
! 				Phonize('S');
break;
/* Y if followed by a vowel */
case 'Y':
--- 629,636 ----
/* KS */
case 'X':
Phonize('K');
! 				if (max_phonemes == 0 || Phone_Len < max_phonemes)
! 					Phonize('S');
break;
/* Y if followed by a vowel */
case 'Y':
Index: contrib/fuzzystrmatch/fuzzystrmatch.h
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.h,v
retrieving revision 1.6
diff -c -r1.6 fuzzystrmatch.h
*** contrib/fuzzystrmatch/fuzzystrmatch.h	5 Sep 2002 00:43:06 -0000	1.6
--- contrib/fuzzystrmatch/fuzzystrmatch.h	6 Jun 2003 16:38:13 -0000
***************
*** 3,9 ****
*
* Functions for "fuzzy" comparison of strings
*
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
--- 3,12 ----
*
* Functions for "fuzzy" comparison of strings
*
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
*
* levenshtein()
* -------------

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073