bug: fuzzystrmatch levenshtein is wrong
The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:
leki_dev=# select levenshtein('','a',2,4,5);
levenshtein
-------------
1
(1 row)
leki_dev=# select levenshtein('a','',2,4,5);
levenshtein
-------------
1
(1 row)
leki_dev=# select levenshtein('aa','a',2,4,5);
levenshtein
-------------
1
(1 row)
leki_dev=# select levenshtein('a','aa',2,4,5);
levenshtein
-------------
1
(1 row)
versus (after patch)
postgres=# select levenshtein('','a',2,4,5);
levenshtein
-------------
2
(1 row)
postgres=# select levenshtein('a','',2,4,5);
levenshtein
-------------
4
(1 row)
postgres=# select levenshtein('aa','a',2,4,5);
levenshtein
-------------
4
(1 row)
postgres=# select levenshtein('a','aa',2,4,5);
levenshtein
-------------
2
(1 row)
patch attached.
Greetings
Marcin Mańk
Attachments:
levenshtein.diffapplication/octet-stream; name=levenshtein.diffDownload
index d42515f..090b970 100644
*** a/contrib/fuzzystrmatch/fuzzystrmatch.c
--- b/contrib/fuzzystrmatch/fuzzystrmatch.c
*************** levenshtein_internal(const char *s, cons
*** 207,219 ****
n = strlen(t);
/*
! * If either m or n is 0, the answer is the other value. This makes sense
! * since it would take that many insertions to build a matching string
*/
if (!m)
! return n;
if (!n)
! return m;
/*
* For security concerns, restrict excessive CPU+RAM usage. (This
--- 207,218 ----
n = strlen(t);
/*
! * if either input is empty, the operation is all inserions or all deletions
*/
if (!m)
! return n * ins_c;
if (!n)
! return m * del_c;
/*
* For security concerns, restrict excessive CPU+RAM usage. (This
*************** levenshtein_internal(const char *s, cons
*** 241,247 ****
/* Initialize the "previous" row to 0..cols */
for (i = 0; i < m; i++)
! prev[i] = i;
/* Loop through rows of the notional array */
for (y = t, j = 1; j < n; y++, j++)
--- 240,246 ----
/* Initialize the "previous" row to 0..cols */
for (i = 0; i < m; i++)
! prev[i] = i * del_c;
/* Loop through rows of the notional array */
for (y = t, j = 1; j < n; y++, j++)
*************** levenshtein_internal(const char *s, cons
*** 252,258 ****
* First cell must increment sequentially, as we're on the j'th row of
* the (m+1)x(n+1) array.
*/
! curr[0] = j;
for (x = s, i = 1; i < m; x++, i++)
{
--- 251,257 ----
* First cell must increment sequentially, as we're on the j'th row of
* the (m+1)x(n+1) array.
*/
! curr[0] = j * ins_c;
for (x = s, i = 1; i < m; x++, i++)
{
*************** levenshtein_internal(const char *s, cons
*** 280,285 ****
--- 279,285 ----
* Because the final value was swapped from the previous row to the
* current row, that's where we'll find it.
*/
+
return prev[m - 1];
}
also there is integer overflow:
postgres=# select levenshtein('aaaaaaaaaaaaaaaa','',1,1000000000,1);
levenshtein
-------------
-1179869184
(1 row)
should we reject arguments greater than,say, 10000 ?
maximum input length is 255 currently, so the maximum numbers involved
would be about 10000*255*2
This would leave some breathing room if we wanted to increase the
maximum input string length.
Greetings
Marcin
On Mon, Dec 7, 2009 at 8:33 AM, marcin mank <marcin.mank@gmail.com> wrote:
The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:
Is this the same problem as bug #5098?
...Robert
On Tue, Dec 8, 2009 at 4:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:
Is this the same problem as bug #5098?
Yes. Exact same change, plus the shortcut evaluation (for when one of
the inputs is empty) was also wrong. I fixed that too.
Greetings
Maricn Mańk
On Mon, Dec 7, 2009 at 8:33 AM, marcin mank <marcin.mank@gmail.com> wrote:
The current behavior of levenshtein(text,text,int,int,int) is wrong. Consider:
leki_dev=# select levenshtein('','a',2,4,5);
levenshtein
-------------
1
(1 row)leki_dev=# select levenshtein('a','',2,4,5);
levenshtein
-------------
1
(1 row)leki_dev=# select levenshtein('aa','a',2,4,5);
levenshtein
-------------
1
(1 row)leki_dev=# select levenshtein('a','aa',2,4,5);
levenshtein
-------------
1
(1 row)versus (after patch)
postgres=# select levenshtein('','a',2,4,5);
levenshtein
-------------
2
(1 row)postgres=# select levenshtein('a','',2,4,5);
levenshtein
-------------
4
(1 row)postgres=# select levenshtein('aa','a',2,4,5);
levenshtein
-------------
4
(1 row)postgres=# select levenshtein('a','aa',2,4,5);
levenshtein
-------------
2
(1 row)patch attached.
I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.
...Robert
Robert Haas wrote:
patch attached.
I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.
The author converted tabs to spaces --- there is not a single tab in the
diff file.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Tue, Dec 8, 2009 at 9:08 PM, Bruce Momjian <bruce@momjian.us> wrote:
Robert Haas wrote:
patch attached.
I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.The author converted tabs to spaces --- there is not a single tab in the
diff file.
Ah. I knew there had to be a reason.
I'm attaching my version of this patch. Barring objections, I am
going to apply this to HEAD and backpatch to 8.4, where this feature
(and the associated bug) were introduced.
...Robert
Attachments:
levenshtein.diffapplication/octet-stream; name=levenshtein.diffDownload
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index d42515f..0be46b0 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -207,13 +207,13 @@ levenshtein_internal(const char *s, const char *t,
n = strlen(t);
/*
- * If either m or n is 0, the answer is the other value. This makes sense
- * since it would take that many insertions to build a matching string
+ * We can transform an empty s into t with n insertions, or a non-empty t
+ * into an empty s with m deletions.
*/
if (!m)
- return n;
+ return n * ins_c;
if (!n)
- return m;
+ return m * del_c;
/*
* For security concerns, restrict excessive CPU+RAM usage. (This
@@ -241,7 +241,7 @@ levenshtein_internal(const char *s, const char *t,
/* Initialize the "previous" row to 0..cols */
for (i = 0; i < m; i++)
- prev[i] = i;
+ prev[i] = i * del_c;
/* Loop through rows of the notional array */
for (y = t, j = 1; j < n; y++, j++)
@@ -252,7 +252,7 @@ levenshtein_internal(const char *s, const char *t,
* First cell must increment sequentially, as we're on the j'th row of
* the (m+1)x(n+1) array.
*/
- curr[0] = j;
+ curr[0] = j * ins_c;
for (x = s, i = 1; i < m; x++, i++)
{
On Tue, Dec 8, 2009 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 8, 2009 at 9:08 PM, Bruce Momjian <bruce@momjian.us> wrote:
Robert Haas wrote:
patch attached.
I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.The author converted tabs to spaces --- there is not a single tab in the
diff file.Ah. I knew there had to be a reason.
I'm attaching my version of this patch. Barring objections, I am
going to apply this to HEAD and backpatch to 8.4, where this feature
(and the associated bug) were introduced.
Done. Yeah, my first commit! However, it appears that someone needs
to tell the pgsql-committers list that rhaas@postgresql.org is allowed
to post there.
...Robert
On Wed, Dec 9, 2009 at 9:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 8, 2009 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 8, 2009 at 9:08 PM, Bruce Momjian <bruce@momjian.us> wrote:
Robert Haas wrote:
patch attached.
I cannot get this patch to apply for anything. All 4 hunks fail, both
on HEAD and on the the pre-8.4-pgindent version of fuzzystrmatch.c.
Either I'm doing something wrong here, or there's something wrong with
this patch file.The author converted tabs to spaces --- there is not a single tab in the
diff file.Ah. I knew there had to be a reason.
I'm attaching my version of this patch. Barring objections, I am
going to apply this to HEAD and backpatch to 8.4, where this feature
(and the associated bug) were introduced.Done. Yeah, my first commit! However, it appears that someone needs
to tell the pgsql-committers list that rhaas@postgresql.org is allowed
to post there.
Uh-oh. Is it bad that I did this between the time Tom updated the
release notes and the time Marc stamped 8.4.2? *ducks and runs for
cover*
...Robert
On Wed, 2009-12-09 at 21:00 -0500, Robert Haas wrote:
Done. Yeah, my first commit!
:)
I think you forgot to update release notes for 8.4.2 :(
--
Devrim GÜNDÜZ, RHCE
Command Prompt - http://www.CommandPrompt.com
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
On Thu, Dec 10, 2009 at 3:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Done. Yeah, my first commit!
Great! Also, thanks for getting this in 8.4.2. Otherwise I would have
to compile this on Windows myself, which is no fun.
About the tabs vs spaces problem - I`ve decided that copying the patch
from a remote machine is best done by selecting it in the terminal and
pasting into a text file. Don`t do that :)
Greetings
Marcin Mańk