Removing duplicate rows in table

Started by Rich Shepardover 1 year ago16 messagesgeneral
Jump to latest
#1Rich Shepard
rshepard@appl-ecosys.com

I've no idea how I entered multiple, identical rows in a table but I want to
delete all but one of these rows.

Here's an example:

bustrac=# select * from projects where proj_nbr = '4242.01';
proj_nbr | proj_name | start_date | end_date | description | notes
----------+----------------+------------+------------+---------------+-------
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
(4 rows)

How do I clean this up so there's only a single row for this project number?

TIA,

Rich

#2Christophe Pettus
xof@thebuild.com
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

If you don't mind taking the time to swap tables, you can always do an INSERT ... SELECT DISTINCT <fields> into a new table, and then swap it with the existing table.

Show quoted text

On Sep 10, 2024, at 08:07, Rich Shepard <rshepard@appl-ecosys.com> wrote:

I've no idea how I entered multiple, identical rows in a table but I want to
delete all but one of these rows.

Here's an example:

bustrac=# select * from projects where proj_nbr = '4242.01';
proj_nbr | proj_name | start_date | end_date | description | notes ----------+----------------+------------+------------+---------------+-------
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy | (4 rows)

How do I clean this up so there's only a single row for this project number?

TIA,

Rich

#3Ron
ronljohnsonjr@gmail.com
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

On Tue, Sep 10, 2024 at 11:07 AM Rich Shepard <rshepard@appl-ecosys.com>
wrote:

I've no idea how I entered multiple, identical rows in a table but I want
to
delete all but one of these rows.

Here's an example:

bustrac=# select * from projects where proj_nbr = '4242.01';
proj_nbr | proj_name | start_date | end_date | description |
notes

----------+----------------+------------+------------+---------------+-------
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
(4 rows)

How do I clean this up so there's only a single row for this project
number?

https://www.postgresqltutorial.com/postgresql-window-function/postgresql-row_number/

I'd use row_number to delete records where row_number > 1.

--
Death to America, and butter sauce.
Iraq lobster!

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

On 9/10/24 08:07, Rich Shepard wrote:

I've no idea how I entered multiple, identical rows in a table but I
want to
delete all but one of these rows.

Is there a Primary Key or Unique index on this table?

Here's an example:

bustrac=# select * from projects where proj_nbr = '4242.01';
 proj_nbr |   proj_name    | start_date |  end_date  |  description  |
notes
----------+----------------+------------+------------+---------------+-------
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
(4 rows)

How do I clean this up so there's only a single row for this project
number?

TIA,

Rich

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Andreas Kretschmer
andreas@a-kretschmer.de
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

Am 10.09.24 um 17:07 schrieb Rich Shepard:

I've no idea how I entered multiple, identical rows in a table but I
want to
delete all but one of these rows.

Here's an example:

bustrac=# select * from projects where proj_nbr = '4242.01';
 proj_nbr |   proj_name    | start_date |  end_date  | description  |
notes
----------+----------------+------------+------------+---------------+-------
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
 4242.01  | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
(4 rows)

How do I clean this up so there's only a single row for this project
number?

TIA,

Rich

you can use the hidden ctid-column:

postgres=# create table demo (id int, val text);
CREATE TABLE
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# select ctid, * from demo;
 ctid  | id |  val
-------+----+-------
 (0,1) |  1 | test1
 (0,2) |  1 | test1
 (0,3) |  1 | test1
 (0,4) |  1 | test1
(4 rows)

postgres=# with my_ctid as (select min(ctid) from demo where id = 1 and
val = 'test1') delete from demo using my_ctid where id=1 and val='test1'
and ctid != my_ctid.min;
DELETE 3
postgres=# select ctid, * from demo;
 ctid  | id |  val
-------+----+-------
 (0,1) |  1 | test1
(1 row)

postgres=#

--
Andreas Kretschmer
CYBERTEC PostgreSQL Services and Support

#6Rich Shepard
rshepard@appl-ecosys.com
In reply to: Adrian Klaver (#4)
Re: Removing duplicate rows in table

On Tue, 10 Sep 2024, Adrian Klaver wrote:

Is there a Primary Key or Unique index on this table?

Adrian,

No. It didn't occur to me to make the project number a PK as this table is
not related to others in the database.

But, yesterday it occurred to me to make the proj_nbr a PK to eliminate
future issues.

Thanks,

Rich

#7Rich Shepard
rshepard@appl-ecosys.com
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

On Tue, 10 Sep 2024, Francisco Olarte wrote:

Do you have any kid of corruption (i.e, unique index violation) or is
it just a duplicate problem?

Francisco,

Only a duplicate problem because when I created this table I didn't make the
proj_nbr column a PK.

Also, if you do not have any uniqueness criteria consider adding an
"id identity" column, it is useful when shit hits the fan.

Yep. that's what I will do.

Thanks,

Rich

#8Rich Shepard
rshepard@appl-ecosys.com
In reply to: Christophe Pettus (#2)
Re: Removing duplicate rows in table

On Tue, 10 Sep 2024, Christophe Pettus wrote:

If you don't mind taking the time to swap tables, you can always do an
INSERT ... SELECT DISTINCT <fields> into a new table, and then swap it
with the existing table.

Christophe,

I'll make the proj_nbr table the PK then do as you recommend.

Thank you,

Rich

#9Rich Shepard
rshepard@appl-ecosys.com
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

On Tue, 10 Sep 2024, Andreas Kretschmer wrote:

you can use the hidden ctid-column:

postgres=# create table demo (id int, val text);
CREATE TABLE
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# insert into demo values (1, 'test1');
INSERT 0 1
postgres=# select ctid, * from demo;
ᅵctidᅵ | id |ᅵ val
-------+----+-------
ᅵ(0,1) |ᅵ 1 | test1
ᅵ(0,2) |ᅵ 1 | test1
ᅵ(0,3) |ᅵ 1 | test1
ᅵ(0,4) |ᅵ 1 | test1
(4 rows)

postgres=# with my_ctid as (select min(ctid) from demo where id = 1 and val =
'test1') delete from demo using my_ctid where id=1 and val='test1' and ctid
!= my_ctid.min;
DELETE 3
postgres=# select ctid, * from demo;
ᅵctidᅵ | id |ᅵ val
-------+----+-------
ᅵ(0,1) |ᅵ 1 | test1
(1 row)

Thanks, Andreas.

Rich

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rich Shepard (#6)
Re: Removing duplicate rows in table

On 9/10/24 08:29, Rich Shepard wrote:

On Tue, 10 Sep 2024, Adrian Klaver wrote:

Is there a Primary Key or Unique index on this table?

Adrian,

No. It didn't occur to me to make the project number a PK as this table is
not related to others in the database.

But, yesterday it occurred to me to make the proj_nbr a PK to eliminate
future issues.

You might want to do something like:

select proj_nbr, count(*) as ct from projects group by proj_nbr;

to see how big a problem it is. If it is only a few projects it could
just a matter of manually deleting the extras.

Whatever you do:

1) Make sure you have a backup of at least that table.

2) Do the data changes as BEGIN; <changes> COMMIT; or ROLLBACK;

Thanks,

Rich

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Francisco Olarte
folarte@peoplecall.com
In reply to: Rich Shepard (#7)
Re: Removing duplicate rows in table

Rich:

On Tue, 10 Sept 2024 at 17:32, Rich Shepard <rshepard@appl-ecosys.com> wrote:

Only a duplicate problem because when I created this table I didn't make the
proj_nbr column a PK.

Always report this is if you have future problems, so people know if
it is a case of pilot error or corruption, solutions differ and the
good for ones may harm others.

Also, if you do not have any uniqueness criteria consider adding an
"id identity" column, it is useful when shit hits the fan.

Yep. that's what I will do.

Bear in mind the relational model on which relational databases are
modeled needs unique rows ( i.e., no two full rows should be equal on
all fields ). It is not enforced in SQL, but now having it normally
leads to problems. When it cannot be done in any other way, normally
adding an identity column is a cheap way to make them unique. I had
that problem with CDR tables ( it means call detail record, and until
I managed to add circuit identification I had the problem that you can
have two calls from and two the same two numbers with equal start and
end times ) and solved it using an identity column ( added just for
this purpose, after doing it a couple times using ctid in testing, it
is slightly more expensive, but a lot more civilized ).

Francisco Olarte.

#12Rich Shepard
rshepard@appl-ecosys.com
In reply to: Adrian Klaver (#10)
Re: Removing duplicate rows in table

On Tue, 10 Sep 2024, Adrian Klaver wrote:

You might want to do something like:

select proj_nbr, count(*) as ct from projects group by proj_nbr;

to see how big a problem it is. If it is only a few projects it could just a
matter of manually deleting the extras.

Adrian,

It's a small table, not updated in a while. Looking at the example I sent
how do I delete the extras while keeping one when each row has the same
content? Not knowing how to do that is why I wrote.

Whatever you do:
1) Make sure you have a backup of at least that table.
2) Do the data changes as BEGIN; <changes> COMMIT; or ROLLBACK;

Yep. Learned that lesson.

Thanks,

Rich

P.S. Please reply to the mail list so I receive only one copy of your
message, not two.

#13Rich Shepard
rshepard@appl-ecosys.com
In reply to: Rich Shepard (#12)
Re: Removing duplicate rows in table

On Tue, 10 Sep 2024, Rich Shepard wrote:

to see how big a problem it is. If it is only a few projects it could just
a matter of manually deleting the extras.

Not knowing how to do that is why I wrote.

A web search (which I should have done before posting this thread) shows me
how to do this:
<https://www.postgresqltutorial.com/postgresql-tutorial/how-to-delete-duplicate-rows-in-postgresql/&gt;

Rich

#14Erik Wienhold
ewie@ewie.name
In reply to: Rich Shepard (#12)
Re: Removing duplicate rows in table

On 2024-09-10 18:38 +0200, Rich Shepard wrote:

P.S. Please reply to the mail list so I receive only one copy of your
message, not two.

You can configure your list subscription to not receive an extra copy.
That setting is under "Global configuration" on
https://lists.postgresql.org/manage/.

--
Erik

#15Muhammad Usman Khan
usman.k@bitnine.net
In reply to: Rich Shepard (#1)
Re: Removing duplicate rows in table

Hi,
You can try the following CTE which removes all the identical rows and only
leave single row

WITH CTE AS (
SELECT ctid, ROW_NUMBER() OVER (PARTITION BY proj_nbr, proj_name,
start_date, end_date, description, notes ORDER BY proj_nbr) AS rn
FROM projects
WHERE proj_nbr = '4242.02'
)
DELETE FROM projects
WHERE ctid IN (
SELECT ctid FROM CTE WHERE rn > 1
);

On Tue, 10 Sept 2024 at 20:07, Rich Shepard <rshepard@appl-ecosys.com>
wrote:

Show quoted text

I've no idea how I entered multiple, identical rows in a table but I want
to
delete all but one of these rows.

Here's an example:

bustrac=# select * from projects where proj_nbr = '4242.01';
proj_nbr | proj_name | start_date | end_date | description |
notes

----------+----------------+------------+------------+---------------+-------
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
4242.01 | Expert witness | 2008-10-15 | 2008-10-28 | Consol Energy |
(4 rows)

How do I clean this up so there's only a single row for this project
number?

TIA,

Rich

#16Rich Shepard
rshepard@appl-ecosys.com
In reply to: Muhammad Usman Khan (#15)
Re: Removing duplicate rows in table

On Thu, 12 Sep 2024, Muhammad Usman Khan wrote:

You can try the following CTE which removes all the identical rows and only
leave single row

Thank you, Muhammed.

Rich