Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

Started by Daniel Beginover 11 years ago13 messagesgeneral

jfd553@hotmail.com

over 11 years ago

Thanks for your answers Andy; I will keep in mind the procedure you proposed.
About the fields required to find duplicate records, all of them are required (5-9) depending on the table.

Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table scan?
- Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate records;
- Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from big_table;

Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from distinct records? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what the criteria should be and how/why it would affect the selection of an approach:-|

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson
Sent: December-08-14 11:39
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload

On 12/8/2014 10:30 AM, Andy Colson wrote:

On 12/7/2014 9:31 PM, Daniel Begin wrote:

I have just completed the bulk upload of a large database. Some
tables have billions of records and no constraints or indexes have
been applied yet. About 0.1% of these records may have been
duplicated during the upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table
without duplicate records…

- Delete duplicate records from the table based on an
appropriate select clause;

- Create a new table with the results from a select distinct
clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the
job considering …

- I do not know which records were duplicated;

- There are no indexes applied on tables yet;

- There is no OIDS on tables yet;

- The database is currently 1TB but I have plenty of disk
space.

Daniel

How would you detect duplicate? Is there a single field that would be
duplicated? Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the
table
with:

create index bigtable_key on bigtable(key); select key, count(*) from
bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy

I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;

This would minimize the amount of data you have to move around. Depends on how hard step 3 is to write. Index not required but would help both step 3 and 4 be faster.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

John R Pierce

pierce@hogranch.com

over 11 years ago

In reply to: Daniel Begin (#1)

On 12/8/2014 1:59 PM, Daniel Begin wrote:

Thanks for your answers Andy; I will keep in mind the procedure you proposed.
About the fields required to find duplicate records, all of them are required (5-9) depending on the table.

these tables have no field that would normally be designated 'primary key' ?

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andy Colson

andy@squeakycode.net

over 11 years ago

In reply to: Daniel Begin (#1)

On 12/08/2014 03:59 PM, Daniel Begin wrote:

Thanks for your answers Andy; I will keep in mind the procedure you proposed.
About the fields required to find duplicate records, all of them are required (5-9) depending on the table.

Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table scan?
- Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate records;
- Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from big_table;

Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from distinct records? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what the criteria should be and how/why it would affect the selection of an approach:-|

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson
Sent: December-08-14 11:39
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload

On 12/8/2014 10:30 AM, Andy Colson wrote:

On 12/7/2014 9:31 PM, Daniel Begin wrote:

I have just completed the bulk upload of a large database. Some
tables have billions of records and no constraints or indexes have
been applied yet. About 0.1% of these records may have been
duplicated during the upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table
without duplicate records…

- Delete duplicate records from the table based on an
appropriate select clause;

- Create a new table with the results from a select distinct
clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the
job considering …

- I do not know which records were duplicated;

- There are no indexes applied on tables yet;

- There is no OIDS on tables yet;

- The database is currently 1TB but I have plenty of disk
space.

Daniel

How would you detect duplicate? Is there a single field that would be
duplicated? Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the
table
with:

create index bigtable_key on bigtable(key); select key, count(*) from
bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy

I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;

This would minimize the amount of data you have to move around. Depends on how hard step 3 is to write. Index not required but would help both step 3 and 4 be faster.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

I was assuming you could find dups using a single field. The single field could be indexed, and the single field (via index) could find dups, as well as delete rows.

If you have to use all the columns ... well, I'd ignore everything I said. :-)

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Scott Marlowe

scott.marlowe@gmail.com

over 11 years ago

In reply to: Andy Colson (#3)

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Scott Marlowe (#4)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes. See the
"latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Daniel Begin

jfd553@hotmail.com

over 11 years ago

In reply to: Daniel Begin (#1)

Thank Tom,
I understand that the rationale behind choosing to create a new table from
distinct records is that, since both approaches need full table scans,
selecting distinct records is faster (and seems more straight forward) than
finding/deleting duplicates;

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Reply to msg id not found: 003a01d013bd$81bb5930$85320b90$@sss.pgh.pa.us

Marc Mamin

M.Mamin@intershop.de

over 11 years ago

In reply to: Daniel Begin (#6)

Thank Tom,
I understand that the rationale behind choosing to create a new table from
distinct records is that, since both approaches need full table scans,
selecting distinct records is faster (and seems more straight forward) than
finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread. e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select a,b,c into newtable from oldtable where a%8 =1 group by a,b,c;
...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes. See the
"latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Daniel Begin

jfd553@hotmail.com

over 11 years ago

In reply to: Daniel Begin (#1)

Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this "a%8=*" feature?
Best

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Thank Tom,
I understand that the rationale behind choosing to create a new table
from distinct records is that, since both approaches need full table
scans, selecting distinct records is faster (and seems more straight
forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread.
e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set
of duplicate keys is retained, if you can identify additional columns
that constitute a preference order for retaining/discarding dupes. See
the "latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Reply to msg id not found: 002701d0160e$5d504800$17f0d800$@intershop.de

John McKown

john.archie.mckown@gmail.com

over 11 years ago

In reply to: Daniel Begin (#8)

On Fri, Dec 12, 2014 at 9:57 AM, Daniel Begin <jfd553@hotmail.com> wrote:

Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this "a%8=*" feature?
Best

% is the modulus operator.

Assuming "a" is an integer (I don't remember), then doing 8 selects of "a
modulus 8" = for each of the possible results (0..7) will each select
about 1/8 of the entire table (I would guess) and the end result put
together, they will end up selecting all of the original table. I don't
know, myself, why this would be faster. But I'm not any kind of a
PostgreSQL expert either.

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Thank Tom,
I understand that the rationale behind choosing to create a new table
from distinct records is that, since both approaches need full table
scans, selecting distinct records is faster (and seems more straight
forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread.
e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set
of duplicate keys is retained, if you can identify additional columns
that constitute a preference order for retaining/discarding dupes. See
the "latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--

While a transcendent vocabulary is laudable, one must be eternally careful
so that the calculated objective of communication does not become ensconced
in obscurity. In other words, eschew obfuscation.

Maranatha! <><
John McKown

#10

David G. Johnston

david.g.johnston@gmail.com

over 11 years ago

In reply to: John McKown (#9)

John McKown wrote

I don't
know, myself, why this would be faster. But I'm not any kind of a
PostgreSQL expert either.

It is faster because PostgreSQL does not have native parallelism. By using
a%n in a where clause you can start n separate sessions and choose a
different value of n for each one and manually introduce parallelism into
the activity.

Though given this is going to likely be I/O constrained the possible gains
do not scale lineally with the number of sessions - which themselves
effectively max out at the number of cores available to the server.

David J.

--
View this message in context: http://postgresql.nabble.com/Re-Removing-duplicate-records-from-a-bulk-upload-rationale-behind-selecting-a-method-tp5829682p5830353.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

Marc Mamin

M.Mamin@intershop.de

over 11 years ago

In reply to: John McKown (#9)

Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this "a%8=*" feature?
Best

% is the modulus operator.

Assuming "a" is an integer (I don't remember), then doing 8 selects of "a modulus 8" = for each of the possible results (0..7)? will each select about 1/8 of the entire table (I would guess) and the end result put together, they will end up selecting all of the original table. I don't know, myself, why this would be faster. But I'm not any kind of a PostgreSQL expert either.

yes.
Extracting unique values from very large sets is not for free,
neither from the I/O nor from the cpu point of view
spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also consider
(hashtext(value))/%, but the extra work may result in no benefit.
hashtext is not documented as it is not garanteed to stay stables in future Postgres release,
but is safe in such a case (http://lmgtfy.com/?q=postgres+hashtext+partition).

Another point is that I'm not sure that all threads will grep on a shared scan on a freshly created table
where the visiblity hint bit is not yet set:

(see head comment in http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/

You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new COPY FREEZE option:
http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm,
on the other hand, I suppose that you will avoid the extra I/O for the hint bit
if you first copy your data in a temp table, but then you won't be able to parallelize the job
as other transactions won't see the data.
Moreover you need to pay attention to how much work_mem you can afford to each transaction,
knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.
I'm using this approach in aggregations which are more cpu intensive than a simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin

Show quoted text

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Thank Tom,
I understand that the rationale behind choosing to create a new table
from distinct records is that, since both approaches need full table
scans, selecting distinct records is faster (and seems more straight
forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread.
e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as it's
gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set
of duplicate keys is retained, if you can identify additional columns
that constitute a preference order for retaining/discarding dupes. See
the "latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
?
While a transcendent vocabulary is laudable, one must be eternally careful so that the calculated objective of communication does not become ensconced in obscurity. In other words, eschew obfuscation.

Maranatha! <><
John McKown

#12

Daniel Begin

jfd553@hotmail.com

over 11 years ago

In reply to: Marc Mamin (#11)

Thank Marc (and all others)

I knew that nothing was for free and understanding where the costs come from
would provide me with some rationale to make my choice!

However, I understand from your answer that there is no definitive approach
to do it right at this time (considering my specific context). Since the
tables are quite similar (number of fields and data types), I will then
probably explore some of the promising approaches proposed so far (below) to
get facts.

a) select distinct * into newtable from oldtable;

b) select a,b,c into newtable from oldtable group by a,b,c;

c) select distinct a,b,c into newtable from oldtable where a%6=* group by
a,b,c;

d) select a,b,c into newtable from oldtable where a%6=* group by a,b,c;

c&d will be run after having set the visibility bit with a select count.
Running on 8 cores, only 6 will be used (a%6=*)

Something else to add/replace on scenarios above (a-d) ?

Before going further, here are some context/concerns you may wish to comment
.

- Most of the tables do not fit in my computer's 32GB memory - since they
are usually between 35GB and 350GB;

- All the tables have a bigint ID I can use with the modulo operator
(id%6=*);

Working on Windows platform.

- work_mem: 1024MB

- maintenance_work_mem: 16384 (Understand the value must be significantly
larger than work_mem)

- shared_buffers: 128MB (Understand that on Windows the useful range is 64MB
to 512MB)

Waiting comments and hoping to get back soon with useful results

Daniel

From: Marc Mamin [mailto:M.Mamin@intershop.de]
Sent: December-12-14 14:25
To: John McKown; Daniel Begin
Cc: Tom Lane; Scott Marlowe; Andy Colson; PostgreSQL General
Subject: AW: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this "a%8=*" feature?
Best

% is the modulus operator.

Assuming "a" is an integer (I don't remember), then doing 8 selects of "a

modulus 8" = for each of the possible results (0..7)? will each select about
1/8 of the entire table (I would guess) and the end result put together,
they will end up selecting all of the original table. I don't know, myself,
why this would be faster. But I'm not any kind of a PostgreSQL expert
either.

yes.
Extracting unique values from very large sets is not for free,
neither from the I/O nor from the cpu point of view
spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also
consider
(hashtext(value))/%, but the extra work may result in no benefit.
hashtext is not documented as it is not garanteed to stay stables in future
Postgres release,
but is safe in such a case
(http://lmgtfy.com/?q=postgres+hashtext+partition).

Another point is that I'm not sure that all threads will grep on a shared
scan on a freshly created table
where the visiblity hint bit is not yet set:

(see head comment in
http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/

You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new
COPY FREEZE option:
http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm,
on the other hand, I suppose that you will avoid the extra I/O for the hint
bit
if you first copy your data in a temp table, but then you won't be able to
parallelize the job
as other transactions won't see the data.
Moreover you need to pay attention to how much work_mem you can afford to
each transaction,
knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.
I'm using this approach in aggregations which are more cpu intensive than a
simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Thank Tom,
I understand that the rationale behind choosing to create a new table
from distinct records is that, since both approaches need full table
scans, selecting distinct records is faster (and seems more straight
forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one

thread.

e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;

select

a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

Best regards,
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

Scott Marlowe <scott.marlowe@gmail.com> writes:

If you're de-duping a whole table, no need to create indexes, as

it's

gonna have to hit every row anyway. Fastest way I've found has been:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

If you want to use less than the whole row, you can use select
distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set
of duplicate keys is retained, if you can identify additional columns
that constitute a preference order for retaining/discarding dupes.

See

the "latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To

make

changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
?
While a transcendent vocabulary is laudable, one must be eternally careful

so that the calculated objective of communication does not become ensconced
in obscurity. In other words, eschew obfuscation.

Show quoted text

Maranatha! <><
John McKown

#13

Jonathan Vanasco

postgres@2xlp.com

over 11 years ago

In reply to: Daniel Begin (#12)

On Dec 8, 2014, at 9:35 PM, Scott Marlowe wrote:

select a,b,c into newtable from oldtable group by a,b,c;

On pass, done.

This is a bit naive, but couldn't this approach potentially be faster (depending on the system)?

SELECT a, b, c INTO duplicate_records FROM ( SELECT a, b, c, count(*) AS counted FROM source_table GROUP BY a, b, c ) q_inner WHERE q_inner.counted > 1;
DELETE FROM source_table USING duplicate_records WHERE source_table.a = duplicate_records.a AND source_table.b = duplicate_records.b AND source_table.c = duplicate_records.c;

It would require multiple full table scans, but it would minimize the writing to disk -- and isn't a 'read' operation usually much more efficient than a 'write' operation? If the duplicate checking is only done on a small subset of columns, indexes could speed things up too.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general