Efficiently move data from one table to another, with FK constraints?
I am using COPY to bulk load large volumes (i.e. multi GB range) of data to a staging table in a PostgreSQL 8.3. For performance, the staging table has no constraints, no primary key, etc. I want to move that data into the "real" tables, but need some advice on how to do that efficiently.
Here's a simple, abbreviated example of tables and relations I'm working with (in reality there are a lot more columns and foreign keys).
/* The raw bulk-loaded data. No indexes or constraints. */
CREATE TABLE log_entry (
req_time TIMESTAMP NOT NULL,
url TEXT NOT NULL,
bytes INTEGER NOT NULL
);
/* Where the data will be moved to. Will have indexes, etc */
CREATE TABLE request (
id BIGSERIAL PRIMARY KEY,
req_time TIMESTAMP WITH TIME ZONE NOT NULL,
bytes INTEGER NOT NULL,
fk_url INTEGER REFERENCES url NOT NULL,
);
CREATE TABLE url (
id SERIAL PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
);
Is there a way to move this data in bulk efficiently? Specifically I'm wondering how to handle the foreign keys? The naive approach is:
1) For each column that is a foreign key in the target table, do INSERT ... SELECT DISTINCT ... to copy all the values into the appropriate child tables.
2) For each row in log_entry, do a similar insert to insert the data with the appropriate foreign keys.
3) delete the contents of table log_entry using TRUNCATE
Obviously, this would be very slow when handling tens of millions of records. Are there faster approaches to solving this problem?
Rob W wrote:
I am using COPY to bulk load large volumes (i.e. multi GB
range) of data to a staging table in a PostgreSQL 8.3. For
performance, the staging table has no constraints, no primary
key, etc. I want to move that data into the "real" tables,
but need some advice on how to do that efficiently.Here's a simple, abbreviated example of tables and relations
I'm working with (in reality there are a lot more columns and
foreign keys)./* The raw bulk-loaded data. No indexes or constraints. */
CREATE TABLE log_entry (
req_time TIMESTAMP NOT NULL,
url TEXT NOT NULL,
bytes INTEGER NOT NULL
);/* Where the data will be moved to. Will have indexes, etc */
CREATE TABLE request (
id BIGSERIAL PRIMARY KEY,
req_time TIMESTAMP WITH TIME ZONE NOT NULL,
bytes INTEGER NOT NULL,
fk_url INTEGER REFERENCES url NOT NULL,
);CREATE TABLE url (
id SERIAL PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
);Is there a way to move this data in bulk efficiently?
Specifically I'm wondering how to handle the foreign keys?
The naive approach is:1) For each column that is a foreign key in the target table,
do INSERT ... SELECT DISTINCT ... to copy all the values
into the appropriate child tables.
2) For each row in log_entry, do a similar insert to insert
the data with the appropriate foreign keys.
3) delete the contents of table log_entry using TRUNCATEObviously, this would be very slow when handling tens of
millions of records. Are there faster approaches to solving
this problem?
How about something like that:
INSERT INTO url (path) (SELET DISTINCT url FROM log_entry);
Then
INSERT INTO request (req_time, bytes, fk_url)
(SELECT l.req_time, l.bytes, u.id FROM log_entry AS l JOIN url AS u ON (l.url = u.path));
I didn't test it, so there may be syntax errors and stuff.
But I doubt that it can be done much more efficiently.
Creating an index on log_entry(url) *might* improve performance. Check with EXPLAIN.
The TRUNCATE should not be a very expensive operation.
Yours,
Laurenz Albe