Using database to find file doublettes in my computer

Started by Lothar Behrensover 17 years ago7 messagesgeneral
Jump to latest
#1Lothar Behrens
lothar.behrens@lollisoft.de

Hi,

I have a problem to find as fast as possible files that are double or
in other words, identical.
Also identifying those files that are not identical.

My approach was to use dir /s and an awk script to convert it to a sql
script to be imported into a table.
That done, I could start issuing queries.

But how to query for files to display a 'left / right view' for each
file that is on multible places ?

I mean this:

This File;Also here
C:\some.txt;C:\backup\some.txt
C:\some.txt;C:\backup1\some.txt
C:\some.txt;C:\backup2\some.txt

but have only this list:

C:\some.txt
C:\backup\some.txt
C:\backup1\some.txt
C:\backup2\some.txt

The reason for this is because I am faced with the problem of ECAD
projects that are copied around
many times and I have to identify what files are here missing and what
files are there.

So a manual approach is as follows:

1) Identify one file (schematic1.sch) and see, where are copies of
it.
2) Compare the files of both directories and make a desision about
what files to use further.
3) Determine conflicts, thus these files can't be copied together
for a cleanup.

Are there any approaches or help ?

This is a very time consuming job and I am searching for any solution
that helps me save time :-)

I know that those problems did not arise when the projects are well
structured and in a version management system. But that isn't here :-)

Thanks

Lothar

#2Eus
eus@member.fsf.org
In reply to: Lothar Behrens (#1)
Re: Using database to find file doublettes in my computer

Hi Ho!

--- On Tue, 11/18/08, Lothar Behrens <lothar.behrens@lollisoft.de> wrote:

Hi,

I have a problem to find as fast as possible files that are
double or
in other words, identical.
Also identifying those files that are not identical.

My approach was to use dir /s and an awk script to convert
it to a sql
script to be imported into a table.
That done, I could start issuing queries.

But how to query for files to display a 'left / right
view' for each
file that is on multible places ?

I mean this:

This File;Also here
C:\some.txt;C:\backup\some.txt
C:\some.txt;C:\backup1\some.txt
C:\some.txt;C:\backup2\some.txt

but have only this list:

C:\some.txt
C:\backup\some.txt
C:\backup1\some.txt
C:\backup2\some.txt

The reason for this is because I am faced with the problem
of ECAD
projects that are copied around
many times and I have to identify what files are here
missing and what
files are there.

So a manual approach is as follows:

1) Identify one file (schematic1.sch) and see, where are
copies of
it.
2) Compare the files of both directories and make a
desision about
what files to use further.
3) Determine conflicts, thus these files can't be
copied together
for a cleanup.

Are there any approaches or help ?

I also have been in this kind of circumstance before, but I work under GNU/Linux as always.

1. At that time, I used `md5sum' to generate the fingerprint of all files in a given directory to be cleaned up.

2. Later, I created a simple Java program to group the names of all files that had the same fingerprint (i.e., MD5 hash).

3. I simply deleted the files with the same MD5 hash but one file with a good filename (in my case, the filename couldn't be relied on to perform a comparison since it differed by small additions like date, author's name, and the like).

4. After that, I used my brain to find related files based on the filenames (e.g., `[2006-05-23] Jeff - x.txt' should be the same as `Jenny - x.txt'). Of course, the Java program also helped me in grouping the files that I thought to be related.

5. Next, I perused the related files to see whether most of the contents were the same. If yes, I took the latest one based on the modified time.

This is a very time consuming job and I am searching for
any solution
that helps me save time :-)

Well, I think I saved a lot of time at that time to be able to eliminate about 7,000 files out of 15,000 files in about two weeks.

I know that those problems did not arise when the projects
are well
structured and in a version management system. But that
isn't here :-)

I hope you employ such a system ASAP :-)

Thanks

Lothar

Best regards,

Eus (FSF member #4445)

In this digital era, where computing technology is pervasive,

your freedom depends on the software controlling those computing devices.

Join free software movement today!

It is free as in freedom, not as in free beer!

Join: http://www.fsf.org/jf?referrer=4445

#3Craig Ringer
craig@2ndquadrant.com
In reply to: Lothar Behrens (#1)
Re: Using database to find file doublettes in my computer

Lothar Behrens wrote:

But how to query for files to display a 'left / right view' for each
file that is on multible places ?

One approach is to use a query to extract the names of all files with
duplicates, and store the results in a TEMPORARY table with a UNIQUE
index (or PRIMARY KEY) on the filename.

You then self-inner-join your paths table to its self, with the join
condition being that the filename part of the path EXISTS in the table
of files having duplicates. You should also filter out rows where the
first filename is <= the second filename to avoid outputting (a,b) and
(b,a) for each duplicate, and to avoid outputting (a,a) and (b,b) rows.

You can do much the same thing in a single query without the temp table,
but I think you'll need a correlated subquery to check for duplicates
(or a join on a subquery that'll really expand the results to be
processed), so the temp table approach is probably going to be lots faster.

Quick example code (untested but should give you the idea), assumes you
have a function get_filename(...) that extracts just the filename part
of the path:

CREATE TABLE paths (
   path PRIMARY KEY,
   --- other fields
);

-- Populate paths

-- Once paths is populated, extract duplicates:

SELECT get_filename(path) AS fn, count(path) AS n
FROM paths HAVING count(path) > 1
INTO TEMPORARY TABLE dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
AND p1.path > p2.path;

You can replace get_filename(fn) with appropriate code, but I'd write a
quick SQL function marked IMMUTABLE to neatly wrap up the pathname
extraction instead.

--
Craig Ringer

#4Sam Mason
sam@samason.me.uk
In reply to: Lothar Behrens (#1)
Re: Using database to find file doublettes in my computer

On Mon, Nov 17, 2008 at 11:22:47AM -0800, Lothar Behrens wrote:

I have a problem to find as fast as possible files that are double or
in other words, identical.
Also identifying those files that are not identical.

I'd probably just take a simple Unix command line approach, something
like:

find /base/dir -type f -exec md5sum {} \; | sort | uniq -Dw 32

this will give you a list of files whose contents are identical
(according to an MD5 hash). An alternative would be to put the hashes
into a database and run the matching up there.

Sam

#5Gerhard Heift
ml-postgresql-20081012-3518@gheift.de
In reply to: Sam Mason (#4)
Re: Using database to find file doublettes in my computer

On Tue, Nov 18, 2008 at 12:36:42PM +0000, Sam Mason wrote:

On Mon, Nov 17, 2008 at 11:22:47AM -0800, Lothar Behrens wrote:

I have a problem to find as fast as possible files that are double or
in other words, identical.
Also identifying those files that are not identical.

I'd probably just take a simple Unix command line approach, something
like:

find /base/dir -type f -exec md5sum {} \; | sort | uniq -Dw 32

You save a little bit of time by using

find /base/dir -type f -print0 | xargs -0 md5sum | sort | uniq -Dw 32

this will give you a list of files whose contents are identical
(according to an MD5 hash). An alternative would be to put the hashes
into a database and run the matching up there.

Sam

Gerhard

#6Lothar Behrens
lothar.behrens@lollisoft.de
In reply to: Lothar Behrens (#1)
Re: Using database to find file doublettes in my computer

On 18 Nov., 07:40, cr...@postnewspapers.com.au (Craig Ringer) wrote:

-- Once paths is populated, extract duplicates:

SELECT get_filename(path) AS fn, count(path) AS n
FROM paths HAVING count(path) > 1
INTO TEMPORARY TABLE dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
     ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
  AND p1.path > p2.path;

You can replace get_filename(fn) with appropriate code, but I'd write a
quick SQL function marked IMMUTABLE to neatly wrap up the pathname
extraction instead.

Hi Craig,

I have done the steps as you described. I have about 14000 files with
an md5sum.
Based on the full filename I have updated the md5sum in my base table
'ECADFiles'.

With the following query I see about 2900 files that are available
multible times:

select "Name", count("Pfad") As n
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1

Using this query I see 13000 double files as a sum:

select sum(n) from (
select "Name", count("Pfad") As n
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1) as temp

Using the following query I get ~ 129000 records:

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
ON (p1."Name" = p2."Name")
where EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND
"Anzahl" > 1)
and p1."Datei" > p2."Datei"

I have expected a smaller amount of records due to the fact that for 4
files each available 2 times (sum = 8) I
have 8 records in ECADFiles, but must have 4 in the above result.

So for an average of 2 doubles I expected half the files from
ECADFiles, because one is exactly right and the other
is on the left.

In general this results in about the same or less records than 13000
multible files.

Why did I get these 129000 records ?

I assume a rotating from files on the left to the right, thus about n
- 1 times too much records. Thus I have tested this:

select sum(n), sum(r) from (
select "Name", count("Pfad") As n, count("Pfad") * (count("Pfad") - 1)
As r
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1
) as temp

But I got 259240. This is probably not correct.

Testing that with the first left file to search on right I get the n
occurences at all as of n double files. Assuming there are
n - 1 too much, I have got my expected result by changing the last AND
rule to the opposite:

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
ON (p1."Name" = p2."Name")
where p1."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\
\MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name"
AND "Anzahl" > 1)
and p1."Datei" > p2."Datei"

Gives 7 records with 7 different right files and the 8th on the left.

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
ON (p1."Name" = p2."Name")
where p2."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\
\MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name"
AND "Anzahl" > 1)
and p1."Datei" < p2."Datei"

Gives 7 records with 7 different left files and the 8th on the right.

Any ideas how to remove these unwanted records ?
This seems not to be easy, because I see problems araising when
changing the data a cursor runs over.

Thanks

Lothar

#7Craig Ringer
craig@2ndquadrant.com
In reply to: Lothar Behrens (#6)
Re: Using database to find file doublettes in my computer

Lothar Behrens wrote:

I have expected a smaller amount of records due to the fact that for 4
files each available 2 times (sum = 8) I
have 8 records in ECADFiles, but must have 4 in the above result.

So for an average of 2 doubles I expected half the files from
ECADFiles, because one is exactly right and the other
is on the left.

It's a combinatorial problem. If you have 3 duplicate files, eg:

INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt');

then the query process I described above will output the matches:

C:\path\file1.txt | C:\path2\file1.txt
/path/file1.txt | C:\path2\file1.txt
/path/file1.txt | C:\path\file1.txt

because while it avoids showing both (A,B) and (B,A) pairs, for any A, B
and C it'll show:

(A,B)
(A,C)
(B,C)

I've attached test SQL that does the above.

Presumably, you want to only show, say:

(A,B)
(A,C)

or maybe:

(filename, A)
(filename, B)
(filename, C)

If that's what you want, you need to work a little differently. The
attached SQL in dups_test2.sql shows one way you might do it, by
generating a list of files with duplicates then listing all the
locations each appears in. Again, you can do it without the temp table,
it'll probably just be slower. None of what I've written is particularly
fast anyway - it evaluates those regular expressions many more times
than should be required, for example.

--
Craig Ringer

Attachments:

dups_setup.sqltext/x-sql; name=dups_setup.sqlDownload
dups_test.sqltext/x-sql; name=dups_test.sqlDownload
dups_test2.sqltext/x-sql; name=dups_test2.sqlDownload