How to idenity duplicate rows

Started by David Inglisabout 20 years ago4 messagesgeneral
Jump to latest
#1David Inglis
david.inglis@ideloveto.com.au

Can anybody assist with this problem I have a table that has some duplicated
rows of data, I want to place a unique constraint on the columns userid
and procno to eliminate this problem in the future but how do I identify and
get rid of the existing duplication.

Thanks in advance

Regards

David Inglis

#2Peter Eisentraut
peter_e@gmx.net
In reply to: David Inglis (#1)
Re: How to idenity duplicate rows

David Inglis wrote:

Can anybody assist with this problem I have a table that has some
duplicated rows of data, I want to place a unique constraint on the
columns userid and procno to eliminate this problem in the future but
how do I identify and get rid of the existing duplication.

To find them, something like

SELECT a, b, c FROM table GROUP BY a, b, c HAVING count(*) > 1;

comes to mind, where you have to list all columns of the table in place
of a, b, c.

As for deleting all but one row in a duplicated group, you're going to
have to get at them by the oid or ctid columns perhaps.

The other idea is to run CREATE TABLE newtable AS SELECT DISTINCT * FROM
oldtable;.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#3Berend Tober
btober@seaworthysys.com
In reply to: Peter Eisentraut (#2)
Re: How to idenity duplicate rows

Peter Eisentraut wrote:

David Inglis wrote:

Can anybody assist with this problem I have a table that has some
duplicated rows of data, I want to place a unique constraint on the
columns userid and procno to eliminate this problem in the future but
how do I identify and get rid of the existing duplication.

To find them, something like

SELECT a, b, c FROM table GROUP BY a, b, c HAVING count(*) > 1;

comes to mind, where you have to list all columns of the table in place
of a, b, c.

As for deleting all but one row in a duplicated group, you're going to
have to get at them by the oid or ctid columns perhaps.

The other idea is to run CREATE TABLE newtable AS SELECT DISTINCT * FROM
oldtable;.

This doesn't bring over to the new table any foreign key relationships
or triggers.

Another approach (if you don't have OID's) is to create uniqueness by
appending a column to the table, populating it with sequential integers.
Then you proceed as otherwise suggested above by using aggregation to
identify the duplicated rows.

#4Jan Cruz
malebug@gmail.com
In reply to: Peter Eisentraut (#2)
Re: How to idenity duplicate rows

As for deleting all but one row in a duplicated group, you're going to
have to get at them by the oid or ctid columns perhaps.

The other idea is to run CREATE TABLE newtable AS SELECT DISTINCT * FROM
oldtable;.

I believe getting oid and/or ctid is not possible since it would not
display/get duplicate records
in a "HAVING CLAUSE" since their oid/ctid wouldn't be the same.

And creating a newtable and use SELECT DISTINCT isn't an acceptable idea to
the audit team when you have millions of records in a table in production
server when you only need to remove let say 70 records from that table.

I tried another approach where I queried and insert the duplicate record (35
records) into
a new/temporary table. Then I created a stored procedure something like
this:
--START

DECLARE
foo table;

BEGIN
FOR foo IN
SELECT * FROM newtable
LOOP
DELETE FROM oldtable
where oldtable.field1 = foo.field1
and oldtable.field2 = foo.field2
and oldtable.field3 = foo.field3
and oldtable.field4 = foo.field4
...';
END LOOP;
END;

-- END

Problem with this approach I got 35 duplicate records (count = 2)
from the new table and delete only 20 records from the oldtable.
If I am not mistaken it should delete 70 records.

I wonder