Removing duplicate records from a bulk upload

Started by Daniel Beginover 11 years ago3 messagesgeneral
Jump to latest
#1Daniel Begin
jfd553@hotmail.com

I have just completed the bulk upload of a large database. Some tables have billions of records and no constraints or indexes have been applied yet. About 0.1% of these records may have been duplicated during the upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table without duplicate records…

- Delete duplicate records from the table based on an appropriate select clause;

- Create a new table with the results from a select distinct clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the job considering …

- I do not know which records were duplicated;

- There are no indexes applied on tables yet;

- There is no OIDS on tables yet;

- The database is currently 1TB but I have plenty of disk space.

Daniel

#2Andy Colson
andy@squeakycode.net
In reply to: Daniel Begin (#1)
Re: Removing duplicate records from a bulk upload

On 12/7/2014 9:31 PM, Daniel Begin wrote:

I have just completed the bulk upload of a large database. Some tables
have billions of records and no constraints or indexes have been applied
yet. About 0.1% of these records may have been duplicated during the
upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table without
duplicate records…

- Delete duplicate records from the table based on an
appropriate select clause;

- Create a new table with the results from a select distinct
clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the job
considering …

- I do not know which records were duplicated;

- There are no indexes applied on tables yet;

- There is no OIDS on tables yet;

- The database is currently 1TB but I have plenty of disk space.

Daniel

How would you detect duplicate? Is there a single field that would be
duplicated? Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the table
with:

create index bigtable_key on bigtable(key);
select key, count(*) from bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Andy Colson
andy@squeakycode.net
In reply to: Andy Colson (#2)
Re: Removing duplicate records from a bulk upload

On 12/8/2014 10:30 AM, Andy Colson wrote:

On 12/7/2014 9:31 PM, Daniel Begin wrote:

I have just completed the bulk upload of a large database. Some tables
have billions of records and no constraints or indexes have been applied
yet. About 0.1% of these records may have been duplicated during the
upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table without
duplicate records…

- Delete duplicate records from the table based on an
appropriate select clause;

- Create a new table with the results from a select distinct
clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the job
considering …

- I do not know which records were duplicated;

- There are no indexes applied on tables yet;

- There is no OIDS on tables yet;

- The database is currently 1TB but I have plenty of disk
space.

Daniel

How would you detect duplicate? Is there a single field that would be
duplicated? Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the table
with:

create index bigtable_key on bigtable(key);
select key, count(*) from bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy

I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;

This would minimize the amount of data you have to move around. Depends
on how hard step 3 is to write. Index not required but would help both
step 3 and 4 be faster.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general