Modifying COPY TO
I am interested in hacking COPY TO such that one can specify that
rows are copied in a certain index order. I got as far as
src/backend/commands/copy.c:CopyTo(), and it looks like I would need
to modify the call to heap_beginscan() so that it uses a key. However,
I couldn't figure out how to provide one, or if I'm even looking at the
right area. Ideally, this behavior would be specified with a flag,
perhaps: "WITH INDEX <index_name>" or "WITH PRIMARY KEY"
or something similar.
The motivation for this change is as follows. I have a fairly large
database (10 million+ records) that mirrors the data in a proprietary
system. The only access to that data is through exported flat files.
Currently, those flat files are copied directly into a staging area in the
db via a COPY FROM, the actual tables are truncated, and the
staging data is inserted into the live tables. Since the data is read-only,
it doesn't matter that it is recreated every day. However, as you
can imagine, the import process takes quite a while (several hours).
Also, rebuilding the db from scratch every day loses any statistical
information gathered from the execution of queries during the day.
A possibility that I would like to pursue is to keep the staging data
from the previous day, do a COPY TO, import the new data into
another staging table with a COPY FROM, then export the fresh
data with another COPY TO. Then, I can write a fast C/C++
program to do a line-by-line comparison of each record, isolating
the ones that have changed from the previous day. I can then
emit those records in a change file that should be relatively small
and easy to update. Of course, this scheme can only work if
COPY TO emits the records in a reliable order.
Any assistance on this project would be greatly appreciated. The
best I can see, I'm stuck on line 1053 from copy.c:
scandesc = heap_beginscan(rel, mySnapshot, 0, NULL);
I suspect that I want it to look like this:
scandesc = heap_beginscan(rel, mySnapshot, 1, key);
where 'key' is an appropriately constructed ScanKey. It looks
like I want to call ScanKeyEntryInitialize(), but I'm not sure what
parameters I need to pass to it to get an index or the primary
key. I mostly need help building the ScanKey object. I think I
can figure out how to hack the custom option, etc. I should
mention that I am using the 7.4.7 codebase on Linux 2.4.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129
"Dave Held" <dave.held@arrayservicesgrp.com> writes:
I am interested in hacking COPY TO such that one can specify that
rows are copied in a certain index order. I got as far as=20
src/backend/commands/copy.c:CopyTo(), and it looks like I would need
to modify the call to heap_beginscan() so that it uses a key.
Actually you'd need to change it to be an index_beginscan call.
Offhand I don't think you need a scan key; the point of a scan key is to
filter the returned rows, and it doesn't sound like that's what you
want. So it should work to do
scandesc = index_beginscan(rel, idx, mySnapshot, 0, NULL);
(plus appropriate changes to the usage of the scandesc)
which leaves only the problem of opening the proper index.
regards, tom lane
* Dave Held (dave.held@arrayservicesgrp.com) wrote:
A possibility that I would like to pursue is to keep the staging data
from the previous day, do a COPY TO, import the new data into
another staging table with a COPY FROM, then export the fresh
data with another COPY TO. Then, I can write a fast C/C++
program to do a line-by-line comparison of each record, isolating
the ones that have changed from the previous day. I can then
emit those records in a change file that should be relatively small
and easy to update. Of course, this scheme can only work if
COPY TO emits the records in a reliable order.
sort -u old new > changes
?
Stephen
Instead of just being able to specify an index to use, how hard would it
be to allow COPY TO to work from a generic query instead of just a
table? That way you wouldn't need to do a diff, you could just export
only new records. ISTM there's far more use for copying from a query
result that just copying by index order.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
Jim C. Nasby wrote:
Instead of just being able to specify an index to use, how hard would it
be to allow COPY TO to work from a generic query instead of just a
table? That way you wouldn't need to do a diff, you could just export
only new records. ISTM there's far more use for copying from a query
result that just copying by index order.
The reason we don't support queries with COPY is because COPY reads from
the raw heap rather than going through the executor. We could
special-case it so queries go through the executor but would take a
little work.
The only trick I can think of for now is to use SELECT ... INTO
TEMPORARY tab ... oRDER BY and then use COPY to dump the table. It will
then dump in the order of the ORDER BY.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On 2005-02-25, "Dave Held" <dave.held@arrayservicesgrp.com> wrote:
A possibility that I would like to pursue is to keep the staging data
from the previous day, do a COPY TO, import the new data into
another staging table with a COPY FROM, then export the fresh
data with another COPY TO. Then, I can write a fast C/C++
program to do a line-by-line comparison of each record, isolating
the ones that have changed from the previous day. I can then
emit those records in a change file that should be relatively small
and easy to update.
I have an application that does something like this, but rather than use an
external program, I do the comparison in the database itself:
- import data from external system into a temporary table
- compare the temporary table against the live data (a full outer join
is a convenient way of doing this - I create an index on the temp table
first)
- perform insert/update/delete for each record that was added, changed
or removed
In my case the compare/update is in a pl/pgsql function. My data is only
2-3 million rows, a bit smaller than yours, but I have to update hourly,
not daily, and spend no more than 5-10 minutes on each update (currently
I can do it in 5: 2 to load the data, 3 to do the compare/update).
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
"Jim C. Nasby" <decibel@decibel.org> writes:
... ISTM there's far more use for copying from a query
result that just copying by index order.
Yeah. The other point is that it's entirely likely that an external
sort will be faster than using an indexscan to produce the sorted order.
If you instead create a command like
COPY FROM (SELECT whatever FROM foo ORDER BY something)
then you give the optimizer a chance at deciding what to do ... not that
I promise it will get it right, but a fixed choice is certain to be
wrong.
regards, tom lane