Duplicate Row Removal

Started by Peter Atkinsover 20 years ago5 messagesgeneral

patkins@directpartners.com

over 20 years ago

All,

I have a duplicate row problem and to make matters worse some tables
don't have a PK or any unique identifier. See below:

uid | cleanval | timestamp | received
-----+-------+------------+----------
38 | 5 | 1125081799 | 1
38 | 14 | 1122683252 | 0
38 | 5 | 1125081799 | 1
38 | 14 | 1122683252 | 0

I was hoping to have a system oid for each row but it looks like that's
not the case. Anyone have any thoughts on how to remove dups? I have
about 40 tables of various sizes.

Cheers,
Peter

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Peter Atkins (#1)

Re: Duplicate Row Removal

"Peter Atkins" <patkins@directpartners.com> writes:

I was hoping to have a system oid for each row but it looks like that's
not the case. Anyone have any thoughts on how to remove dups?

ctid always works ...

regards, tom lane

Dean Gibson (DB Administrator)

postgresql4@ultimeth.com

over 20 years ago

In reply to: Peter Atkins (#1)

Re: Duplicate Row Removal

CREATE TABLE new_name AS SELECT DISTINCT * FROM old_name;

DROP TABLE old_name;

ALTER TABLE new_name RENAME TO old_name;

Show quoted text

On 2005-11-04 17:15, Peter Atkins wrote:

All,

I have a duplicate row problem and to make matters worse some tables don't have a PK or any unique identifier.

Anyone have any thoughts on how to remove dups?

Berend Tober

btober@seaworthysys.com

over 20 years ago

In reply to: Dean Gibson (DB Administrator) (#3)

Re: Duplicate Row Removal

Dean Gibson (DB Administrator) wrote:

CREATE TABLE new_name AS SELECT DISTINCT * FROM old_name;

DROP TABLE old_name;

ALTER TABLE new_name RENAME TO old_name;

The problem with this technique is that it doesn't account for indexes,
foreign key references, and other dependencies.

Another approach is to temporarily add an integer column, populate it
with sequential values, and then use that new column to uniquely
identify the rows that are otherwise duplicates. Then you can use
aggregation to identify and delete the rows you don't need, followed by
dropping the temporary extra column. HTH.

-- BMT

Show quoted text

On 2005-11-04 17:15, Peter Atkins wrote:

All,

I have a duplicate row problem and to make matters worse some tables
don't have a PK or any unique identifier.
Anyone have any thoughts on how to remove dups?

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Florian Pflug

fgp@phlo.org

over 20 years ago

In reply to: Berend Tober (#4)

Re: [GENERAL] Duplicate Row Removal

Berend Tober wrote:

Dean Gibson (DB Administrator) wrote:

CREATE TABLE new_name AS SELECT DISTINCT * FROM old_name;

DROP TABLE old_name;

ALTER TABLE new_name RENAME TO old_name;

The problem with this technique is that it doesn't account for indexes,
foreign key references, and other dependencies.

Another approach is to temporarily add an integer column, populate it
with sequential values, and then use that new column to uniquely
identify the rows that are otherwise duplicates. Then you can use
aggregation to identify and delete the rows you don't need, followed by
dropping the temporary extra column. HTH.

Or, you use the column ctid, which exists in every table (I think it
tells you where to find the tuple in the datafile) and is guaranteed to
be unique (because two tuples cant reference the same place in the
datafile).

So, try

delete from table where exists (select 1 from table t1 where
t1.f1 = table.f1 and
t1.f2 = table.f2 and
...
...
t1.fn = table.fn and
textin(tidout(t1.ctid)) < textin(tidout(table.ctid))
) ;

The textin(varcharout(x)) trick is necessary, because you need to
choose _one_ row of each group of duplicates that shall survive.
Since the type of ctid (which is tid) doesn't support smaller-than
comparisons, we convert it to some varchar, and compare that. It's
not important how the converted varchar looks, and which ctid turns
out to be the smallest - all that matters it that there _will_ _be_
exactly one smallest ctid for each group of duplicated, and that
tuple will survive, because the select would find any row in that case.

greetings, Florian Pflug

PS: I believe this should be added to some FAQ - It's a neat trick, but
difficult to figure out, and the question seems to come up quite often.