Levenshtein Distance with more than 255 characters

Started by Janek Sendrowskiover 12 years ago8 messagesgeneral
Jump to latest
#1Janek Sendrowski
janek12@web.de

<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div style="font-family: Verdana;font-size: 12.0px;">
<div>Hi,</div>

<div>&nbsp;</div>

<div>I&#39;m searching for an optimized Levenshtein Distance like Postgresql&#39;s. My problem is that I want to compare strings with a length over 255 characters.</div>

<div><span style="line-height: 1.6em;">Does anyone know a solution?</span></div>

<div>&nbsp;</div>

<div>Janek Sendrowski</div>
</div></div></body></html>

#2Szymon Guz
mabewlun@gmail.com
In reply to: Janek Sendrowski (#1)
Re: Levenshtein Distance with more than 255 characters

On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:

Hi,

I'm searching for an optimized Levenshtein Distance like Postgresql's. My
problem is that I want to compare strings with a length over 255 characters.
Does anyone know a solution?

Janek Sendrowski

Hi,
I'm not sure there is anything different from what you've found in
core/contribs. But you can always use pg/plpython or pg/plperl procedure
with some external library calculating the distance.

regards
Szymon

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Szymon Guz (#2)
Re: Levenshtein Distance with more than 255 characters

Szymon Guz <mabewlun@gmail.com> writes:

On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:

I'm searching for an optimized Levenshtein Distance like Postgresql's. My
problem is that I want to compare strings with a length over 255 characters.
Does anyone know a solution?

I'm not sure there is anything different from what you've found in
core/contribs. But you can always use pg/plpython or pg/plperl procedure
with some external library calculating the distance.

Well, you could just rebuild the fuzzystrmatch module with a different
value for MAX_LEVENSHTEIN_STRLEN. The comments in the code note that the
comparison cost is roughly O(N^2) in the string length, and the reason for
having a limit at all is to ensure the function runtime doesn't get out of
hand --- but it seems likely to me that 255 is an unnecessarily
conservative limit. If you wanted to do a few tests and report back on
just how slow it can get, we might be persuaded to raise the stock
setting.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Szymon Guz
mabewlun@gmail.com
In reply to: Tom Lane (#3)
Re: Levenshtein Distance with more than 255 characters

On 6 September 2013 08:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Szymon Guz <mabewlun@gmail.com> writes:

On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:

I'm searching for an optimized Levenshtein Distance like Postgresql's.

My

problem is that I want to compare strings with a length over 255

characters.

Does anyone know a solution?

I'm not sure there is anything different from what you've found in
core/contribs. But you can always use pg/plpython or pg/plperl procedure
with some external library calculating the distance.

Well, you could just rebuild the fuzzystrmatch module with a different
value for MAX_LEVENSHTEIN_STRLEN. The comments in the code note that the
comparison cost is roughly O(N^2) in the string length, and the reason for
having a limit at all is to ensure the function runtime doesn't get out of
hand --- but it seems likely to me that 255 is an unnecessarily
conservative limit. If you wanted to do a few tests and report back on
just how slow it can get, we might be persuaded to raise the stock
setting.

regards, tom lane

I've checked that and I think we could raise the limit without any problem

I've set
#define MAX_LEVENSHTEIN_STRLEN 255 * 255

I was using the levenshtein function comparing two strings of the same
length

Strings length: 640
Difference: 531
Time: 2.5ms

Strings length: 1920
Difference: 1258
Time: 20ms

Strings length: 5760
Difference: 1811
Time: 146ms

regards,
Szymon

#5Janek Sendrowski
janek12@web.de
In reply to: Szymon Guz (#4)
Re: Levenshtein Distance with more than 255 characters

Where can I change levensthein_max_length?
Janek Sendrowski

Von: "Szymon Guz" <mabewlun@gmail.com>
An: "Tom Lane" <tgl@sss.pgh.pa.us>
Betreff: Re: [GENERAL] Levenshtein Distance with more than 255 characters
On 6 September 2013 08:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Szymon Guz <mabewlun@gmail.com> writes:

On 6 September 2013 01:00, Janek Sendrowski <janek12@web.de> wrote:

I'm searching for an optimized Levenshtein Distance like Postgresql's.

My

problem is that I want to compare strings with a length over 255

characters.

Does anyone know a solution?

I'm not sure there is anything different from what you've found in
core/contribs. But you can always use pg/plpython or pg/plperl procedure
with some external library calculating the distance.

Well, you could just rebuild the fuzzystrmatch module with a different
value for MAX_LEVENSHTEIN_STRLEN. The comments in the code note that the
comparison cost is roughly O(N^2) in the string length, and the reason for
having a limit at all is to ensure the function runtime doesn't get out of
hand --- but it seems likely to me that 255 is an unnecessarily
conservative limit. If you wanted to do a few tests and report back on
just how slow it can get, we might be persuaded to raise the stock
setting.

regards, tom lane

I've checked that and I think we could raise the limit without any problem

I've set
#define MAX_LEVENSHTEIN_STRLEN 255 * 255

I was using the levenshtein function comparing two strings of the same
length

Strings length: 640
Difference: 531
Time: 2.5ms

Strings length: 1920
Difference: 1258
Time: 20ms

Strings length: 5760
Difference: 1811
Time: 146ms

regards,
Szymon

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6John R Pierce
pierce@hogranch.com
In reply to: Janek Sendrowski (#5)
Re: Levenshtein Distance with more than 255 characters

On 9/6/2013 2:00 PM, janek12@web.de wrote:

Where can I change levensthein_max_length?

as the message you quoted said, its

#define MAX_LEVENSHTEIN_STRLEN

I'd expect this (without bothering to look) to be in a .h file in the
fuzzystrmatch contributed module directory.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Janek Sendrowski
janek12@web.de
In reply to: John R Pierce (#6)
Re: Levenshtein Distance with more than 255 characters

<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>Do you know the destination. I cant find it.</div></div></body></html>

#8Michael Paquier
michael@paquier.xyz
In reply to: Janek Sendrowski (#7)
Re: Levenshtein Distance with more than 255 characters

On Sat, Sep 7, 2013 at 6:28 AM, Janek Sendrowski <janek12@web.de> wrote:

Do you know the destination. I cant find it.

Here it is:
$ find . -name "*.[c|h]" | xgrep MAX_LEVENSHTEIN_STRLEN
./contrib/fuzzystrmatch/levenshtein.c:#define MAX_LEVENSHTEIN_STRLEN 255
./contrib/fuzzystrmatch/levenshtein.c: if (m > MAX_LEVENSHTEIN_STRLEN ||
./contrib/fuzzystrmatch/levenshtein.c: n > MAX_LEVENSHTEIN_STRLEN)
./contrib/fuzzystrmatch/levenshtein.c:
MAX_LEVENSHTEIN_STRLEN)));
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general