General guidance if there is an in dadabase solution or should stay as excel vba solution.

Started by Henry Drexlerover 14 years ago5 messagesgeneral
Jump to latest
#1Henry Drexler
alonup8tb@gmail.com

I have no problem doing this in excel vba, though as the list grows larger
obviously excel has row limits.

What is being done:
There is a column of data imported into the db - they are just text strings,
there are about 80,000 rows of them. The goal is to do a single character
elimination to find matches.

so for instance the data is a bunch of rows of this:

hello there
what is your name
happy birthday
we are winner
we are winners
we like the sky
task to do
tasks to do

so for the above in excel I created a macro that will remove one character
and compare and do this for each character of each text string.

The final product:

hello there
what is your name
happy birthday
we are winner we are winners
we are winners we are winner
we like the sky
task to do tasks to do
tasks to do task to do

so you can see that it found the matches with being one character off.

Is this something best done outside of the db and in excel as I am doing or
is it possible to do it in db?

Note I am not looking for someone to give a whole solution - just if they
know it can be done let me know the direction so I can research it and
figure it out.

Any advice is welcome.

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Henry Drexler (#1)
Re: General guidance if there is an in dadabase solution or should stay as excel vba solution.

Look at this module for the actual comparison algorithms (found in Appendix
F)

"fuzzystrmatch"

Performance would be my only concern but you have that issue either way.
With "plpgsql" you can do most things in the database you could do in VBA.
Whether you want to bog the DB down with a processor intensive process like
this is another question to consider.

I am hoping you are putting in limits such as requiring that the first
character (or even first partial word) are equal before even checking for an
off-by-one error. With the "Levenshtein" algorithm you'd be looking for a
value of "1" to match your current behavior.

In short, what you are doing (given your specification below) in VBA is also
doable in PostgreSQL.

David J.

From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Henry Drexler
Sent: Monday, September 19, 2011 9:10 AM
To: pgsql-general
Subject: [GENERAL] General guidance if there is an in dadabase solution or
should stay as excel vba solution.

I have no problem doing this in excel vba, though as the list grows larger
obviously excel has row limits.

What is being done:

There is a column of data imported into the db - they are just
text strings, there are about 80,000 rows of them. The goal is to do a
single character elimination to find matches.

so for instance the data is a bunch of rows of this:

hello there

what is your name

happy birthday

we are winner

we are winners

we like the sky

task to do

tasks to do

so for the above in excel I created a macro that will remove one character
and compare and do this for each character of each text string.

The final product:

hello there

what is your name

happy birthday

we are winner we are winners

we are winners we are winner

we like the sky

task to do tasks to do

tasks to do task to do

so you can see that it found the matches with being one character off.

Is this something best done outside of the db and in excel as I am doing or
is it possible to do it in db?

Note I am not looking for someone to give a whole solution - just if they
know it can be done let me know the direction so I can research it and
figure it out.

Any advice is welcome.

#3Henry Drexler
alonup8tb@gmail.com
In reply to: David G. Johnston (#2)
Re: General guidance if there is an in dadabase solution or should stay as excel vba solution.

Thanks you that is the kind of suggestion I was looking for - I will look
into plpgsql.

Yes, there are several optimizations in it - though due to the actual data
the first few characters cannot be tested. Some of the actual optimizations
are only to reach out to the surrounding 100 rows and to skip numbers in the
characters.

On Mon, Sep 19, 2011 at 10:17 AM, David Johnston <polobo@yahoo.com> wrote:

Show quoted text

Look at this module for the actual comparison algorithms (found in Appendix
F)****

** **

“fuzzystrmatch”****

** **

Performance would be my only concern but you have that issue either way.
With “plpgsql” you can do most things in the database you could do in VBA.
Whether you want to bog the DB down with a processor intensive process like
this is another question to consider.****

** **

I am hoping you are putting in limits such as requiring that the first
character (or even first partial word) are equal before even checking for an
off-by-one error. With the “Levenshtein” algorithm you’d be looking for a
value of “1” to match your current behavior.****

** **

In short, what you are doing (given your specification below) in VBA is
also doable in PostgreSQL.****

** **

David J.****

** **

** **

*From:* pgsql-general-owner@postgresql.org [mailto:
pgsql-general-owner@postgresql.org] *On Behalf Of *Henry Drexler
*Sent:* Monday, September 19, 2011 9:10 AM
*To:* pgsql-general
*Subject:* [GENERAL] General guidance if there is an in dadabase solution
or should stay as excel vba solution.****

** **

I have no problem doing this in excel vba, though as the list grows larger
obviously excel has row limits.****

** **

** **

What is being done:****

There is a column of data imported into the db - they are just
text strings, there are about 80,000 rows of them. The goal is to do a
single character elimination to find matches.****

** **

so for instance the data is a bunch of rows of this:****

** **

hello there****

what is your name****

happy birthday****

we are winner****

we are winners****

we like the sky****

task to do****

tasks to do****

** **

so for the above in excel I created a macro that will remove one character
and compare and do this for each character of each text string.****

** **

The final product:****

** **

hello there****

what is your name****

happy birthday****

we are winner we are winners****

we are winners we are winner****

we like the sky****

task to do tasks to do****

tasks to do task to do****

** **

** **

so you can see that it found the matches with being one character off.****

** **

** **

Is this something best done outside of the db and in excel as I am doing or
is it possible to do it in db?****

** **

Note I am not looking for someone to give a whole solution - just if they
know it can be done let me know the direction so I can research it and
figure it out.****

** **

Any advice is welcome.****

** **

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: Henry Drexler (#3)
Re: General guidance if there is an in dadabase solution or should stay as excel vba solution.

You can probably do this without plpgsql through liberal use of CTEs (WITH) and sub-queries.

Also look at arrayed types for "saving" matches and filtering out already tested pairs.

David J.

On Sep 19, 2011, at 10:37, Henry Drexler <alonup8tb@gmail.com> wrote:

Show quoted text

Thanks you that is the kind of suggestion I was looking for - I will look into plpgsql.

Yes, there are several optimizations in it - though due to the actual data the first few characters cannot be tested. Some of the actual optimizations are only to reach out to the surrounding 100 rows and to skip numbers in the characters.

On Mon, Sep 19, 2011 at 10:17 AM, David Johnston <polobo@yahoo.com> wrote:
Look at this module for the actual comparison algorithms (found in Appendix F)

“fuzzystrmatch”

Performance would be my only concern but you have that issue either way. With “plpgsql” you can do most things in the database you could do in VBA. Whether you want to bog the DB down with a processor intensive process like this is another question to consider.

I am hoping you are putting in limits such as requiring that the first character (or even first partial word) are equal before even checking for an off-by-one error. With the “Levenshtein” algorithm you’d be looking for a value of “1” to match your current behavior.

In short, what you are doing (given your specification below) in VBA is also doable in PostgreSQL.

David J.

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Henry Drexler
Sent: Monday, September 19, 2011 9:10 AM
To: pgsql-general
Subject: [GENERAL] General guidance if there is an in dadabase solution or should stay as excel vba solution.

I have no problem doing this in excel vba, though as the list grows larger obviously excel has row limits.

What is being done:

There is a column of data imported into the db - they are just text strings, there are about 80,000 rows of them. The goal is to do a single character elimination to find matches.

so for instance the data is a bunch of rows of this:

hello there

what is your name

happy birthday

we are winner

we are winners

we like the sky

task to do

tasks to do

so for the above in excel I created a macro that will remove one character and compare and do this for each character of each text string.

The final product:

hello there

what is your name

happy birthday

we are winner we are winners

we are winners we are winner

we like the sky

task to do tasks to do

tasks to do task to do

so you can see that it found the matches with being one character off.

Is this something best done outside of the db and in excel as I am doing or is it possible to do it in db?

Note I am not looking for someone to give a whole solution - just if they know it can be done let me know the direction so I can research it and figure it out.

Any advice is welcome.

#5Henry Drexler
alonup8tb@gmail.com
In reply to: David G. Johnston (#4)
Re: General guidance if there is an in dadabase solution or should stay as excel vba solution.

excellent - thank you again.

On Mon, Sep 19, 2011 at 11:05 AM, David Johnston <polobo@yahoo.com> wrote:

Show quoted text

You can probably do this without plpgsql through liberal use of CTEs (WITH)
and sub-queries.

Also look at arrayed types for "saving" matches and filtering out already
tested pairs.

David J.

On Sep 19, 2011, at 10:37, Henry Drexler <alonup8tb@gmail.com> wrote:

Thanks you that is the kind of suggestion I was looking for - I will look
into plpgsql.

Yes, there are several optimizations in it - though due to the actual data
the first few characters cannot be tested. Some of the actual optimizations
are only to reach out to the surrounding 100 rows and to skip numbers in the
characters.

On Mon, Sep 19, 2011 at 10:17 AM, David Johnston < <polobo@yahoo.com>
polobo@yahoo.com> wrote:

Look at this module for the actual comparison algorithms (found in
Appendix F)****

** **

“fuzzystrmatch”****

** **

Performance would be my only concern but you have that issue either way.
With “plpgsql” you can do most things in the database you could do in VBA.
Whether you want to bog the DB down with a processor intensive process like
this is another question to consider.****

** **

I am hoping you are putting in limits such as requiring that the first
character (or even first partial word) are equal before even checking for an
off-by-one error. With the “Levenshtein” algorithm you’d be looking for a
value of “1” to match your current behavior.****

** **

In short, what you are doing (given your specification below) in VBA is
also doable in PostgreSQL.****

** **

David J.****

** **

** **

*From:* <pgsql-general-owner@postgresql.org>
pgsql-general-owner@postgresql.org [mailto:<pgsql-general-owner@postgresql.org>
pgsql-general-owner@postgresql.org] *On Behalf Of *Henry Drexler
*Sent:* Monday, September 19, 2011 9:10 AM
*To:* pgsql-general
*Subject:* [GENERAL] General guidance if there is an in dadabase solution
or should stay as excel vba solution.****

** **

I have no problem doing this in excel vba, though as the list grows larger
obviously excel has row limits.****

** **

** **

What is being done:****

There is a column of data imported into the db - they are
just text strings, there are about 80,000 rows of them. The goal is to do a
single character elimination to find matches.****

** **

so for instance the data is a bunch of rows of this:****

** **

hello there****

what is your name****

happy birthday****

we are winner****

we are winners****

we like the sky****

task to do****

tasks to do****

** **

so for the above in excel I created a macro that will remove one character
and compare and do this for each character of each text string.****

** **

The final product:****

** **

hello there****

what is your name****

happy birthday****

we are winner we are winners****

we are winners we are winner****

we like the sky****

task to do tasks to do****

tasks to do task to do****

** **

** **

so you can see that it found the matches with being one character off.***
*

** **

** **

Is this something best done outside of the db and in excel as I am doing
or is it possible to do it in db?****

** **

Note I am not looking for someone to give a whole solution - just if they
know it can be done let me know the direction so I can research it and
figure it out.****

** **

Any advice is welcome.****

** **