efficiently migrating 'old' data from one table to another

Started by Jonathan Vanascoabout 9 years ago9 messagesgeneral

Jump to latest

Jonathan Vanasco

postgres@2xlp.com

about 9 years ago

I'm just wondering if there's a more efficient way of handling a certain periodic data migration.

We have a pair of tables with this structure:

table_a__live
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

table_a__archive
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`. The entries are copied over to the archive, then deleted.

The staleness is calculated based on age-- so we need to use INTERVAL. the "live" table can have anywhere from 100k to 20MM records.

the primary key on `table_a__live` is a composite of column_1 & column_2,

In order to minimize scanning the table, we opted to hint migrations with a dedicated column:

ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL;

so our migration is then based on that `is_migrate` column:

BEGIN;
UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS TRUE;
DELETE FROM table_a__live WHERE is_migrate IS TRUE;
COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows.

can anyone suggest a better approach?

I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of disk-io on that approach too.

fwiw we're on postgres9.6.1

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Berend Tober

btober@broadstripe.net

about 9 years ago

In reply to: Jonathan Vanasco (#1)

Re: efficiently migrating 'old' data from one table to another

----- Original Message -----

From: "Jonathan Vanasco" <postgres@2xlp.com>
To: "pgsql-general general" <pgsql-general@postgresql.org>
Sent: Thursday, January 12, 2017 3:06:14 PM
Subject: [GENERAL] efficiently migrating 'old' data from one table to another

I'm just wondering if there's a more efficient way of handling a certain
periodic data migration.

We have a pair of tables with this structure:

table_a__live
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

table_a__archive
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

periodically, we must migrate items that are 'stale' from `table_a__live ` to
`table_a__archive`. The entries are copied over to the archive, then
deleted.

The staleness is calculated based on age-- so we need to use INTERVAL. the
"live" table can have anywhere from 100k to 20MM records.

the primary key on `table_a__live` is a composite of column_1 & column_2,

In order to minimize scanning the table, we opted to hint migrations with a
dedicated column:

ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE
is_migrate IS NOT NULL;

so our migration is then based on that `is_migrate` column:

BEGIN;
UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp <
transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT
column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS
TRUE;
DELETE FROM table_a__live WHERE is_migrate IS TRUE;
COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow from
postgres re-writing all the rows.

can anyone suggest a better approach?

I considered copying everything to a tmp table then inserting/deleting based
on that table -- but there's a lot of disk-io on that approach too.

Review manual section 7.8.2. Data-Modifying Statements in WITH

https://www.postgresql.org/docs/9.6/static/queries-with.html

-- B

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

John R Pierce

pierce@hogranch.com

about 9 years ago

In reply to: Jonathan Vanasco (#1)

Re: efficiently migrating 'old' data from one table to another

On 1/12/2017 12:06 PM, Jonathan Vanasco wrote:

I'm just wondering if there's a more efficient way of handling a certain periodic data migration.

partition the tables by some date interval such as week (if you do this
archiving weekly). each week, disconnect the oldest partition from the
'active' partition set, and add it to the 'archive' partition set.
voila, no inserts, deletes, or updates are done at all, just some
metadata operations.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Adrian Klaver

adrian.klaver@aklaver.com

about 9 years ago

In reply to: Jonathan Vanasco (#1)

Re: efficiently migrating 'old' data from one table to another

On 01/12/2017 12:06 PM, Jonathan Vanasco wrote:

I'm just wondering if there's a more efficient way of handling a certain periodic data migration.

We have a pair of tables with this structure:

table_a__live
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

table_a__archive
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`. The entries are copied over to the archive, then deleted.

The staleness is calculated based on age-- so we need to use INTERVAL. the "live" table can have anywhere from 100k to 20MM records.

the primary key on `table_a__live` is a composite of column_1 & column_2,

In order to minimize scanning the table, we opted to hint migrations with a dedicated column:

ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL;

so our migration is then based on that `is_migrate` column:

BEGIN;
UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS TRUE;
DELETE FROM table_a__live WHERE is_migrate IS TRUE;
COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows.

Maybe I am missing something, but why do the UPDATE?

Why not?:

BEGIN;

INSERT INTO
table_a__archive (column_1, column_2, record_timestamp)
SELECT
column_1, column_2, record_timestamp
FROM
table_a__live
WHERE
record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';

DELETE FROM
table_a__live
WHERE
record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
COMMIT;

With an index on record_timestamp.

can anyone suggest a better approach?

I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of disk-io on that approach too.

fwiw we're on postgres9.6.1

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David G. Johnston

david.g.johnston@gmail.com

about 9 years ago

In reply to: Adrian Klaver (#4)

Re: efficiently migrating 'old' data from one table to another

On Thu, Jan 12, 2017 at 2:45 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

so our migration is then based on that `is_migrate` column:

BEGIN;
UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp

< transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';

INSERT INTO table_a__archive (column_1, column_2,

record_timestamp) SELECT column_1, column_2, record_timestamp FROM
table_a__live WHERE is_migrate IS TRUE;

DELETE FROM table_a__live WHERE is_migrate IS TRUE;
COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow

from postgres re-writing all the rows.

Maybe I am missing something, but why do the UPDATE?

Not to mention doubling the amount of I/O vacuum is going to chew up.

David J.

Merlin Moncure

mmoncure@gmail.com

about 9 years ago

In reply to: Berend Tober (#2)

Re: efficiently migrating 'old' data from one table to another

On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
<btober@broadstripe.net> wrote:

----- Original Message -----

From: "Jonathan Vanasco" <postgres@2xlp.com>
To: "pgsql-general general" <pgsql-general@postgresql.org>
Sent: Thursday, January 12, 2017 3:06:14 PM
Subject: [GENERAL] efficiently migrating 'old' data from one table to another

I'm just wondering if there's a more efficient way of handling a certain
periodic data migration.

We have a pair of tables with this structure:

table_a__live
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

table_a__archive
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

periodically, we must migrate items that are 'stale' from `table_a__live ` to
`table_a__archive`. The entries are copied over to the archive, then
deleted.

The staleness is calculated based on age-- so we need to use INTERVAL. the
"live" table can have anywhere from 100k to 20MM records.

the primary key on `table_a__live` is a composite of column_1 & column_2,

In order to minimize scanning the table, we opted to hint migrations with a
dedicated column:

ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE
is_migrate IS NOT NULL;

so our migration is then based on that `is_migrate` column:

BEGIN;
UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp <
transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT
column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS
TRUE;
DELETE FROM table_a__live WHERE is_migrate IS TRUE;
COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow from
postgres re-writing all the rows.

can anyone suggest a better approach?

I considered copying everything to a tmp table then inserting/deleting based
on that table -- but there's a lot of disk-io on that approach too.

Review manual section 7.8.2. Data-Modifying Statements in WITH

https://www.postgresql.org/docs/9.6/static/queries-with.html

this.

with data as (delete from foo where ... returning * ) insert into
foo_backup select * from data;

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jonathan Vanasco

postgres@2xlp.com

about 9 years ago

In reply to: Merlin Moncure (#6)

Re: efficiently migrating 'old' data from one table to another

On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:

On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
<btober@broadstripe.net> wrote:

Review manual section 7.8.2. Data-Modifying Statements in WITH

https://www.postgresql.org/docs/9.6/static/queries-with.html

this.

with data as (delete from foo where ... returning * ) insert into
foo_backup select * from data;

Thanks, btober and merlin. that's exactly what i want.

On Jan 12, 2017, at 4:45 PM, Adrian Klaver wrote:

Maybe I am missing something, but why do the UPDATE?
Why not?:
...
With an index on record_timestamp.

That's actually the production deployment that we're trying to optimize. Depending on the size of the table (rows, width) it performs "less than great", even with the index on record_timestamp.

The UPDATE actually worked faster in most situations. I honestly don't know why (the only thing that makes sense to me is server-load)... but the update + bool test ended up being (much) faster than the timestamp comparison.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

K. Brannen

kbrannen@pwhome.com

about 9 years ago

In reply to: Jonathan Vanasco (#7)

Re: efficiently migrating 'old' data from one table to another

On Jan 12, 2017, Jonathan Vanasco <postgres@2xlp.com> wrote:

On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:

On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
<btober@broadstripe.net> wrote:

Review manual section 7.8.2. Data-Modifying Statements in WITH

https://www.postgresql.org/docs/9.6/static/queries-with.html

this.

with data as (delete from foo where ... returning * ) insert into
foo_backup select * from data;

Thanks, btober and merlin. that's exactly what i want.

To help you a little more, I just did this for a set of tables within the
last week. :) The heart of the program is this sql:

my $Chunk_size = 10000;
my $Interval = 24;
my $sql = "
WITH
keys AS (
SELECT $pk_column
FROM $table
WHERE $time_column < NOW() - '$Interval MONTHS'::INTERVAL
ORDER BY $pk_column
LIMIT $Chunk_size ),
data AS (
DELETE FROM $table
WHERE $pk_column <= (SELECT MAX($pk_column) FROM keys)
RETURNING * )
INSERT INTO archive_$table SELECT * FROM data;";

That's from Perl, but I suspect you can guess as to what each var should be for
your application. You can set $Chunk_size to whatever you want. There is
obviously a loop around that which executes until we get 0 rows, then we move
on to the next table.

The point of the chunks was to limit the impact on the production tables
as we move data out of them. If you don't have that concern and want to do all
rows at once then remove the LIMIT and ORDER BY.

HTH,
Kevin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Resolved by subject fallback

Merlin Moncure

mmoncure@gmail.com

about 9 years ago

In reply to: K. Brannen (#8)

Re: efficiently migrating 'old' data from one table to another

On Fri, Jan 13, 2017 at 12:03 PM, <kbrannen@pwhome.com> wrote:

On Jan 12, 2017, Jonathan Vanasco <postgres@2xlp.com> wrote:

On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:

On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
<btober@broadstripe.net> wrote:

Review manual section 7.8.2. Data-Modifying Statements in WITH

https://www.postgresql.org/docs/9.6/static/queries-with.html

this.

with data as (delete from foo where ... returning * ) insert into
foo_backup select * from data;

Thanks, btober and merlin. that's exactly what i want.

To help you a little more, I just did this for a set of tables within the
last week. :) The heart of the program is this sql:

my $Chunk_size = 10000;
my $Interval = 24;
my $sql = "
WITH
keys AS (
SELECT $pk_column
FROM $table
WHERE $time_column < NOW() - '$Interval MONTHS'::INTERVAL
ORDER BY $pk_column
LIMIT $Chunk_size ),
data AS (
DELETE FROM $table
WHERE $pk_column <= (SELECT MAX($pk_column) FROM keys)
RETURNING * )
INSERT INTO archive_$table SELECT * FROM data;";

That's from Perl, but I suspect you can guess as to what each var should be for
your application. You can set $Chunk_size to whatever you want. There is
obviously a loop around that which executes until we get 0 rows, then we move
on to the next table.

The point of the chunks was to limit the impact on the production tables
as we move data out of them. If you don't have that concern and want to do all
rows at once then remove the LIMIT and ORDER BY.

FYI, although it's likely ok in this instance, directly inserting
table names without precaution is considered dubious and should be
avoided as practice. SQL injection is a risk, and your code will fail
in the presence of unusual bug legal table names containing spaces.

For posterity handling this kind of action inside the database (via
plpgsql/EXECUTE) in order to leverage some internal routines,
especially quote_ident(), is generally a good idea.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general