Bulk processing & deletion

Started by Ivan Vorasover 14 years ago8 messagesgeneral
Jump to latest
#1Ivan Voras
ivoras@freebsd.org

Hello,

I have a table with a large number of records (millions), on which the
following should be performed:

1. Retrieve a set of records by a SELECT query with a WHERE condition
2. Process these in the application
3. Delete them from the table

Now, in the default read-committed transaction isolation, I can't just
use the same WHERE condition with a DELETE in step 3 as it might delete
more records than are processed in step 1 (i.e. phantom read). I've
thought of several ways around it and would like some feedback on which
would be the most efficient:

#1: Create a giant DELETE WHERE ... IN (...) SQL command for step #3
with primary keys of records from step 1 - but will it hit a SQL string
length limitation in the database? Is there such a limit (and what is it?)

#2: Same as #1 but with batching the records to e.g. 1000 at a time, all
in one transaction

#2: Use a higher isolation level, probably Repeatable Read (PG 9.0) -
but then the question is will this block other clients from inserting
new data into the table? Also, is Repeatable Read enough?

Any other ideas?

#2Alban Hertroys
haramrae@gmail.com
In reply to: Ivan Voras (#1)
Re: Bulk processing & deletion

On 13 October 2011 14:20, Ivan Voras <ivoras@freebsd.org> wrote:

Hello,

I have a table with a large number of records (millions), on which the
following should be performed:

       1. Retrieve a set of records by a SELECT query with a WHERE condition
       2. Process these in the application
       3. Delete them from the table

Now, in the default read-committed transaction isolation, I can't just
use the same WHERE condition with a DELETE in step 3 as it might delete
more records than are processed in step 1 (i.e. phantom read). I've
thought of several ways around it and would like some feedback on which
would be the most efficient:

#1: Create a giant DELETE WHERE ... IN (...) SQL command for step #3
with primary keys of records from step 1 - but will it hit a SQL string
length limitation in the database? Is there such a limit (and what is it?)

#2: Same as #1 but with batching the records to e.g. 1000 at a time, all
in one transaction

#2: Use a higher isolation level, probably Repeatable Read (PG 9.0) -
but then the question is will this block other clients from inserting
new data into the table? Also, is Repeatable Read enough?

Any other ideas?

CREATE TABLE to_delete (
job_created timestamp NOT NULL DEFAULT now(),
fk_id int NOT NULL
);

-- Mark for deletion
INSERT INTO to_delete (fk_id) SELECT id FROM table WHERE condition = true;

-- Process in app
SELECT table.* FROM table INNER JOIN to_delete ON (table.id = to_delete.fk_id);

-- Delete them
DELETE FROM table WHERE id IN (SELECT fk_id FROM to_delete);

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

#3Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Alban Hertroys (#2)
Re: Bulk processing & deletion

If you don't need the data for more then a transaction, or connection
length - use temporary tables to store ids of data you need to delete.
If those change, or move, or something - it means you are missing PK
on that table.

#4Ivan Voras
ivoras@freebsd.org
In reply to: Alban Hertroys (#2)
Re: Bulk processing & deletion

On 13/10/2011 14:34, Alban Hertroys wrote:

Any other ideas?

CREATE TABLE to_delete (
job_created timestamp NOT NULL DEFAULT now(),
fk_id int NOT NULL
);

-- Mark for deletion
INSERT INTO to_delete (fk_id) SELECT id FROM table WHERE condition = true;

-- Process in app
SELECT table.* FROM table INNER JOIN to_delete ON (table.id = to_delete.fk_id);

-- Delete them
DELETE FROM table WHERE id IN (SELECT fk_id FROM to_delete);

Good point. I can even use a temp table for this and make use of
UNLOGGED temp tables when we upgrade to 9.1!

In reply to: Ivan Voras (#1)
Re: Bulk processing & deletion

On 10/13/2011 16:20, Ivan Voras wrote:

Hello,

I have a table with a large number of records (millions), on which the
following should be performed:

1. Retrieve a set of records by a SELECT query with a WHERE condition
2. Process these in the application
3. Delete them from the table

Hello.
Maybe you can just do begin; delete from ..... WHERE ... RETURNING *;,
process records and issue COMMIT after processing?

--
Best regards,
Alexander Pyhalov,
system administrator of Computer Center of Southern Federal University

#6Steve Crawford
scrawford@pinpointresearch.com
In reply to: Ivan Voras (#1)
Re: Bulk processing & deletion

On 10/13/2011 05:20 AM, Ivan Voras wrote:

Hello,

I have a table with a large number of records (millions), on which the
following should be performed:

1. Retrieve a set of records by a SELECT query with a WHERE condition
2. Process these in the application
3. Delete them from the table

Without knowing a bit more, it is difficult to say. A couple questions:

1. Are there conflicting processes - i.e. could multiple applications be
in contention to process the same set of records?

2. Is the processing "all or none" or could individual records fail? If
so, how do you deal with reprocessing or returning those to the main table.

Depending on the nature of your app, it might be feasible to reorder the
actions to move the records to be processed into a "processing" table
and delete them from that table as the records are processed by the
application.

You could move the records into the processing table with:

with foo as (delete from main_table where your_where_clause returning
a,b,c) insert into processing_table (a,b,c) select a,b,c from foo;

In this case I would not recommend temporary or unlogged tables for the
processing table as that becomes the only source of the data once
deleted from the master table.

Cheers,
Steve

#7Ivan Voras
ivoras@freebsd.org
In reply to: Steve Crawford (#6)
Re: Bulk processing & deletion

On 13 October 2011 20:08, Steve Crawford <scrawford@pinpointresearch.com> wrote:

On 10/13/2011 05:20 AM, Ivan Voras wrote:

Hello,

I have a table with a large number of records (millions), on which the
following should be performed:

       1. Retrieve a set of records by a SELECT query with a WHERE
condition
       2. Process these in the application
       3. Delete them from the table

Without knowing a bit more, it is difficult to say. A couple questions:

1. Are there conflicting processes - i.e. could multiple applications be in
contention to process the same set of records?

No, only one bulk processor.

2. Is the processing "all or none" or could individual records fail? If so,
how do you deal with reprocessing or returning those to the main table.

All or none; the nature of thing is that there can be no fatal failures.

Depending on the nature of your app, it might be feasible to reorder the
actions to move the records to be processed into a "processing" table and
delete them from that table as the records are processed by the application.

You could move the records into the processing table with:

with foo as (delete from main_table where your_where_clause returning a,b,c)
insert into processing_table (a,b,c) select a,b,c from foo;

In this case I would not recommend temporary or unlogged tables for the
processing table as that becomes the only source of the data once deleted
from the master table.

Ok, thanks (to everyone)!

#8pasman pasmański
pasman.p@gmail.com
In reply to: Ivan Voras (#4)
Re: Bulk processing & deletion

Unlogged tables can't be temporary.

2011/10/13, Ivan Voras <ivoras@freebsd.org>:

On 13/10/2011 14:34, Alban Hertroys wrote:

Any other ideas?

CREATE TABLE to_delete (
job_created timestamp NOT NULL DEFAULT now(),
fk_id int NOT NULL
);

-- Mark for deletion
INSERT INTO to_delete (fk_id) SELECT id FROM table WHERE condition = true;

-- Process in app
SELECT table.* FROM table INNER JOIN to_delete ON (table.id =
to_delete.fk_id);

-- Delete them
DELETE FROM table WHERE id IN (SELECT fk_id FROM to_delete);

Good point. I can even use a temp table for this and make use of
UNLOGGED temp tables when we upgrade to 9.1!

--
------------
pasman