bug: fuzzystrmatch levenshtein is wrong

Started by marcin mankabout 16 years ago11 messages
#1marcin mank
marcin.mank@gmail.com
1 attachment(s)

The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:

leki_dev=# select levenshtein('','a',2,4,5);
levenshtein
-------------
1
(1 row)

leki_dev=# select levenshtein('a','',2,4,5);
levenshtein
-------------
1
(1 row)

leki_dev=# select levenshtein('aa','a',2,4,5);
levenshtein
-------------
1
(1 row)

leki_dev=# select levenshtein('a','aa',2,4,5);
levenshtein
-------------
1
(1 row)

versus (after patch)

postgres=# select levenshtein('','a',2,4,5);
levenshtein
-------------
2
(1 row)

postgres=# select levenshtein('a','',2,4,5);
levenshtein
-------------
4
(1 row)

postgres=# select levenshtein('aa','a',2,4,5);
levenshtein
-------------
4
(1 row)

postgres=# select levenshtein('a','aa',2,4,5);
levenshtein
-------------
2
(1 row)

patch attached.

Greetings
Marcin Mańk

Attachments:

levenshtein.diffapplication/octet-stream; name=levenshtein.diffDownload
index d42515f..090b970 100644
*** a/contrib/fuzzystrmatch/fuzzystrmatch.c
--- b/contrib/fuzzystrmatch/fuzzystrmatch.c
*************** levenshtein_internal(const char *s, cons
*** 207,219 ****
        n = strlen(t);

        /*
!        * If either m or n is 0, the answer is the other value. This makes sense
!        * since it would take that many insertions to build a matching string
         */
        if (!m)
!               return n;
        if (!n)
!               return m;

        /*
         * For security concerns, restrict excessive CPU+RAM usage. (This
--- 207,218 ----
        n = strlen(t);

        /*
!        * if either input is empty, the operation is all inserions or all deletions
         */
        if (!m)
!               return n * ins_c;
        if (!n)
!               return m * del_c;

        /*
         * For security concerns, restrict excessive CPU+RAM usage. (This
*************** levenshtein_internal(const char *s, cons
*** 241,247 ****

        /* Initialize the "previous" row to 0..cols */
        for (i = 0; i < m; i++)
!               prev[i] = i;

        /* Loop through rows of the notional array */
        for (y = t, j = 1; j < n; y++, j++)
--- 240,246 ----

        /* Initialize the "previous" row to 0..cols */
        for (i = 0; i < m; i++)
!               prev[i] = i * del_c;

        /* Loop through rows of the notional array */
        for (y = t, j = 1; j < n; y++, j++)
*************** levenshtein_internal(const char *s, cons
*** 252,258 ****
                 * First cell must increment sequentially, as we're on the j'th row of
                 * the (m+1)x(n+1) array.
                 */
!               curr[0] = j;

                for (x = s, i = 1; i < m; x++, i++)
                {
--- 251,257 ----
                 * First cell must increment sequentially, as we're on the j'th row of
                 * the (m+1)x(n+1) array.
                 */
!               curr[0] = j * ins_c;

                for (x = s, i = 1; i < m; x++, i++)
                {
*************** levenshtein_internal(const char *s, cons
*** 280,285 ****
--- 279,285 ----
         * Because the final value was swapped from the previous row to the
         * current row, that's where we'll find it.
         */
+
        return prev[m - 1];
  }

#2marcin mank
marcin.mank@gmail.com
In reply to: marcin mank (#1)
Re: bug: fuzzystrmatch levenshtein is wrong

also there is integer overflow:
postgres=# select levenshtein('aaaaaaaaaaaaaaaa','',1,1000000000,1);
levenshtein
-------------
-1179869184
(1 row)

should we reject arguments greater than,say, 10000 ?
maximum input length is 255 currently, so the maximum numbers involved
would be about 10000*255*2

This would leave some breathing room if we wanted to increase the
maximum input string length.

Greetings
Marcin

#3Robert Haas
robertmhaas@gmail.com
In reply to: marcin mank (#1)
Re: bug: fuzzystrmatch levenshtein is wrong

On Mon, Dec 7, 2009 at 8:33 AM, marcin mank <marcin.mank@gmail.com> wrote:

The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:

Is this the same problem as bug #5098?

...Robert

#4marcin mank
marcin.mank@gmail.com
In reply to: Robert Haas (#3)
Re: bug: fuzzystrmatch levenshtein is wrong

On Tue, Dec 8, 2009 at 4:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:

Is this the same problem as bug #5098?

Yes. Exact same change, plus the shortcut evaluation (for when one of
the inputs is empty) was also wrong. I fixed that too.

Greetings
Maricn Mańk

#5Robert Haas
robertmhaas@gmail.com
In reply to: marcin mank (#1)
Re: bug: fuzzystrmatch levenshtein is wrong

On Mon, Dec 7, 2009 at 8:33 AM, marcin mank <marcin.mank@gmail.com> wrote:

The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:

leki_dev=# select levenshtein('','a',2,4,5);
 levenshtein
-------------
          1
(1 row)

leki_dev=# select levenshtein('a','',2,4,5);
 levenshtein
-------------
          1
(1 row)

leki_dev=# select levenshtein('aa','a',2,4,5);
 levenshtein
-------------
          1
(1 row)

leki_dev=# select levenshtein('a','aa',2,4,5);
 levenshtein
-------------
          1
(1 row)

versus (after patch)

postgres=# select levenshtein('','a',2,4,5);
 levenshtein
-------------
          2
(1 row)

postgres=# select levenshtein('a','',2,4,5);
 levenshtein
-------------
          4
(1 row)

postgres=# select levenshtein('aa','a',2,4,5);
 levenshtein
-------------
          4
(1 row)

postgres=# select levenshtein('a','aa',2,4,5);
 levenshtein
-------------
          2
(1 row)

patch attached.

I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.

...Robert

#6Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#5)
Re: bug: fuzzystrmatch levenshtein is wrong

Robert Haas wrote:

patch attached.

I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.

The author converted tabs to spaces --- there is not a single tab in the
diff file.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#6)
1 attachment(s)
Re: bug: fuzzystrmatch levenshtein is wrong

On Tue, Dec 8, 2009 at 9:08 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

patch attached.

I cannot get this patch to apply for anything.  All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.

The author converted tabs to spaces --- there is not a single tab in the
diff file.

Ah. I knew there had to be a reason.

I'm attaching my version of this patch. Barring objections, I am
going to apply this to HEAD and backpatch to 8.4, where this feature
(and the associated bug) were introduced.

...Robert

Attachments:

levenshtein.diffapplication/octet-stream; name=levenshtein.diffDownload
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index d42515f..0be46b0 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -207,13 +207,13 @@ levenshtein_internal(const char *s, const char *t,
 	n = strlen(t);
 
 	/*
-	 * If either m or n is 0, the answer is the other value. This makes sense
-	 * since it would take that many insertions to build a matching string
+	 * We can transform an empty s into t with n insertions, or a non-empty t
+	 * into an empty s with m deletions.
 	 */
 	if (!m)
-		return n;
+		return n * ins_c;
 	if (!n)
-		return m;
+		return m * del_c;
 
 	/*
 	 * For security concerns, restrict excessive CPU+RAM usage. (This
@@ -241,7 +241,7 @@ levenshtein_internal(const char *s, const char *t,
 
 	/* Initialize the "previous" row to 0..cols */
 	for (i = 0; i < m; i++)
-		prev[i] = i;
+		prev[i] = i * del_c;
 
 	/* Loop through rows of the notional array */
 	for (y = t, j = 1; j < n; y++, j++)
@@ -252,7 +252,7 @@ levenshtein_internal(const char *s, const char *t,
 		 * First cell must increment sequentially, as we're on the j'th row of
 		 * the (m+1)x(n+1) array.
 		 */
-		curr[0] = j;
+		curr[0] = j * ins_c;
 
 		for (x = s, i = 1; i < m; x++, i++)
 		{
#8Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#7)
Re: bug: fuzzystrmatch levenshtein is wrong

On Tue, Dec 8, 2009 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 8, 2009 at 9:08 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

patch attached.

I cannot get this patch to apply for anything.  All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.

The author converted tabs to spaces --- there is not a single tab in the
diff file.

Ah.  I knew there had to be a reason.

I'm attaching my version of this patch.  Barring objections, I am
going to apply this to HEAD and backpatch to 8.4, where this feature
(and the associated bug) were introduced.

Done. Yeah, my first commit! However, it appears that someone needs
to tell the pgsql-committers list that rhaas@postgresql.org is allowed
to post there.

...Robert

#9Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#8)
Re: bug: fuzzystrmatch levenshtein is wrong

On Wed, Dec 9, 2009 at 9:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 8, 2009 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Dec 8, 2009 at 9:08 PM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

patch attached.

I cannot get this patch to apply for anything.  All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.

The author converted tabs to spaces --- there is not a single tab in the
diff file.

Ah.  I knew there had to be a reason.

I'm attaching my version of this patch.  Barring objections, I am
going to apply this to HEAD and backpatch to 8.4, where this feature
(and the associated bug) were introduced.

Done.  Yeah, my first commit!  However, it appears that someone needs
to tell the pgsql-committers list that rhaas@postgresql.org is allowed
to post there.

Uh-oh. Is it bad that I did this between the time Tom updated the
release notes and the time Marc stamped 8.4.2? *ducks and runs for
cover*

...Robert

#10Devrim GÜNDÜZ
devrim@gunduz.org
In reply to: Robert Haas (#8)
Re: bug: fuzzystrmatch levenshtein is wrong

On Wed, 2009-12-09 at 21:00 -0500, Robert Haas wrote:

Done. Yeah, my first commit!

:)

I think you forgot to update release notes for 8.4.2 :(

--
Devrim GÜNDÜZ, RHCE
Command Prompt - http://www.CommandPrompt.com
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz

#11marcin mank
marcin.mank@gmail.com
In reply to: Robert Haas (#8)
Re: bug: fuzzystrmatch levenshtein is wrong

On Thu, Dec 10, 2009 at 3:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Done.  Yeah, my first commit!

Great! Also, thanks for getting this in 8.4.2. Otherwise I would have
to compile this on Windows myself, which is no fun.

About the tabs vs spaces problem - I`ve decided that copying the patch
from a remote machine is best done by selecting it in the terminal and
pasting into a text file. Don`t do that :)

Greetings
Marcin Mańk