remote duplicate rows

Started by Junkoneover 19 years ago9 messagesgeneral

junkone1@gmail.com

over 19 years ago

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

Regards

Seede

ljb

ljb220@mindspring.com

over 19 years ago

In reply to: Junkone (#1)

Re: remote duplicate rows

junkone1@gmail.com wrote:

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

Make a new table (with a primary key) and the same columns in order,
and do: INSERT INTO newtable SELECT DISTINCT * FROM oldtable;

Ron Johnson

ron.l.johnson@cox.net

over 19 years ago

In reply to: ljb (#2)

Re: remote duplicate rows

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/13/06 19:36, ljb wrote:

junkone1@gmail.com wrote:

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

Since the 2 colname1 == 2 records are different (extra spaces in
colname2), how do you determine which is the correct record? (Or is
the extra space just an artifact?)

Make a new table (with a primary key) and the same columns in order,
and do: INSERT INTO newtable SELECT DISTINCT * FROM oldtable;

That's a single transaction, and since this is a "very large table",
it would be very unpleasant if it rolled back at 95%.

Of course, we don't know what junkone1's definition of "very large"
is and how beefy his hardware is...

If there are OIDs on the table, you could write a script with this
pseudocode, which because of the "candidate key table", transaction
block and LIMIT TO, allows the script to be restated. Niceties like
printing timestamp and a counter after every commit are always helpful.

CREATE TABLE BIGTABLE_PK (
COLNAME1 INTEGER);

INSERT INTO BIGTABLE_PK
SELECT DISTINCT COLNAME1
FROM BIGTABLE;

CREATE INDEX I_BIGTABLE_PK
ON BIGTABLE_PK (COLNAME1)
TYPE IS SORTED;

DECLARE LOOP_FLAG INTEGER = 1;
WHILE LOOP_FLAG DO
BEGIN TRANSACTION
FOR :X AS EACH ROW OF
SELECT COLNAME1
FROM BIGTABLE_PK
ORDER BY COLNAME1
LIMIT TO 2000 ROWS
DO
DELETE FROM BIGTABLE
WHERE OID IN (SELECT OID
FROM BIGTABLE_PK
WHERE COLNAME1 = :X.COLNAME1
LIMIT TO 1 ROWS);
DELETE FROM BIGTABLE_PK
WHERE COLNAME1 = :X.COLNAME1;
END FOR;
IF (SELECT COUNT(*) FROM BIGTABLE_PK) == 0 THEN
SET LOOP_FLAG = 0;
END IF;
COMMIT;
END ;

- --
Ron Johnson, Jr.
Jefferson LA USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFCMPtS9HxQb37XmcRArYMAKCGEK7ft1PLprnHtpjsLYlgs4t5gACdEksT
JY42ieEmRvehOsuU/o6YFR8=
=MJhV
-----END PGP SIGNATURE-----

A. Kretschmer

andreas.kretschmer@schollglas.com

over 19 years ago

In reply to: Junkone (#1)

Re: remote duplicate rows

am Wed, dem 13.09.2006, um 15:46:58 -0700 mailte Junkone folgendes:

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

begin;
alter table foo rename to tmp;
create table foo as select distinct * from tmp;
commit;

You should create a primary key now to avoid duplicated entries...

HTH, Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47215, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

John Sidney-Woollett

johnsw@wardbrook.com

over 19 years ago

In reply to: A. Kretschmer (#4)

Re: remote duplicate rows

If you have a primary key value (or OID?) then you can delete the
duplicates in situ using something like (untested)

-- should work if never more than 1 duplicate row for colname1, colname2

delete from table where pk_value in (
select min(pk_value)
from table
group by colname1, colname2
having count(*) > 1
)

-- if you can have multiple duplicate rows for colname1, colname2
-- then you need something like

delete from table where pk_value not in (
select min(pk_value)
from table
group by colname1, colname2
having count(*) = 1
)

Hope that helps.

John

A. Kretschmer wrote:

Show quoted text

am Wed, dem 13.09.2006, um 15:46:58 -0700 mailte Junkone folgendes:

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

begin;
alter table foo rename to tmp;
create table foo as select distinct * from tmp;
commit;

You should create a primary key now to avoid duplicated entries...

HTH, Andreas

Andrews, Chris

Chris.Andrews@Lorien.co.uk

over 19 years ago

In reply to: John Sidney-Woollett (#5)

Re: remote duplicate rows

Dunno about quickly, but I usually do something like this (before slapping myself in the face for getting into that state):

CREATE TABLE tn_backup AS SELECT DISTINCT * FROM tn;
TRUNCATE TABLE tn;
INSERT INTO tn VALUES SELECT * from tn_backup;

(Where "tn" is the table name)

May not be the best way, but keeps indexes and stuff on the original table if you don't want to set them all up again. Me lazy?

That said, if you've got foriegn keys pointing at it, the truncate ain't going to work.

Or if you have your data exported as a tab or csv, the use sort | uniq on it and
shove it back in...

-----Original Message-----
From: Junkone [mailto:junkone1@gmail.com]
Sent: 13 September 2006 23:47
To: pgsql-general@postgresql.org
Subject: [GENERAL] remote duplicate rows

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

Regards

Seede

-----------------------------------------
The information contained in this email is confidential and is
intended for the recipient only. If you have received it in error,
please notify us immediately by reply email and then delete it from
your system. Please do not copy it or use it for any purposes, or
disclose its contents to any other person or store or copy this
information in any medium. The views contained in this email are
those of the author and not necessarily those of Lorien plc.

Thank you for your co-operation.

Import Notes

Resolved by subject fallback

Berend Tober

btober@seaworthysys.com

over 19 years ago

In reply to: A. Kretschmer (#4)

Re: remote duplicate rows

A. Kretschmer wrote:

am Wed, dem 13.09.2006, um 15:46:58 -0700 mailte Junkone folgendes:

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

begin;
alter table foo rename to tmp;
create table foo as select distinct * from tmp;
commit;

A couple potential problems here. First, you forgot to drop table tmp. But maybe that is good thing because although the OP hasn't told us anything else useful about the situation, and he has clearly contrived a simplistic facsimile of his real problem, to be useful the table most likely either has foreign key references, and/or is the primary key for other table foreign keys. You're suggestion will break whatever application this data base supports because all the foreign keys will point to table tmp rather than foo afterwards. Similarly, there is the problem of any indexes on the table that would be lost. But I suppose one can make the point that your suggestion is a great solution, given the contrived example and insufficient problem understanding presented by the OP -- I really think he needs more help than he realizes.

Regards,
Berend Tober

Junkone

junkone1@gmail.com

over 19 years ago

In reply to: Andrews, Chris (#6)

Re: remote duplicate rows

Thanks for all of your help. I backed up the table and used the PgAdmin
tool to create Insert statements. It did it in two sets. I reran the
first set and it solved the problem.

Seede

"Andrews, Chris" wrote:

Show quoted text

Dunno about quickly, but I usually do something like this (before slapping myself in the face for getting into that state):

CREATE TABLE tn_backup AS SELECT DISTINCT * FROM tn;
TRUNCATE TABLE tn;
INSERT INTO tn VALUES SELECT * from tn_backup;

(Where "tn" is the table name)

May not be the best way, but keeps indexes and stuff on the original table if you don't want to set them all up again. Me lazy?

That said, if you've got foriegn keys pointing at it, the truncate ain't going to work.

Or if you have your data exported as a tab or csv, the use sort | uniq on it and
shove it back in...

-----Original Message-----
From: Junkone [mailto:junkone1@gmail.com]
Sent: 13 September 2006 23:47
To: pgsql-general@postgresql.org
Subject: [GENERAL] remote duplicate rows

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

Regards

Seede

-----------------------------------------
The information contained in this email is confidential and is
intended for the recipient only. If you have received it in error,
please notify us immediately by reply email and then delete it from
your system. Please do not copy it or use it for any purposes, or
disclose its contents to any other person or store or copy this
information in any medium. The views contained in this email are
those of the author and not necessarily those of Lorien plc.

Thank you for your co-operation.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Sim Zacks

sim@compulab.co.il

over 19 years ago

In reply to: Berend Tober (#7)

Re: remote duplicate rows

You forgot to mention that all the functions/views that utilized that table also now point to the
original table with the changed name, because it doesn't store the table name, it stores the table oid.

Berend Tober wrote:

Show quoted text

A. Kretschmer wrote:

am Wed, dem 13.09.2006, um 15:46:58 -0700 mailte Junkone folgendes:

hI
i have a bad situation that i did not have primary key. so i have a
table like this
colname1 colname2
1 apple
1 apple
2 orange
2 orange

It is a very large table. how do i remove the duplctes quickly annd
without much change.

begin;
alter table foo rename to tmp;
create table foo as select distinct * from tmp;
commit;

A couple potential problems here. First, you forgot to drop table tmp.
But maybe that is good thing because although the OP hasn't told us
anything else useful about the situation, and he has clearly contrived a
simplistic facsimile of his real problem, to be useful the table most
likely either has foreign key references, and/or is the primary key for
other table foreign keys. You're suggestion will break whatever
application this data base supports because all the foreign keys will
point to table tmp rather than foo afterwards. Similarly, there is the
problem of any indexes on the table that would be lost. But I suppose
one can make the point that your suggestion is a great solution, given
the contrived example and insufficient problem understanding presented
by the OP -- I really think he needs more help than he realizes.

Regards,
Berend Tober

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend