Alternative to drop index, load data, recreate index?
When loading very large data exports (> 1 million records) I have found
it necessary to use the following sequence to achieve even reasonable
import performance:
1. Drop all indices on the recipient table
2. Use "copy recipient_table from '/tmp/input.file';"
3. Recreate all indices on the recipient table
However, I now have tables so large that even the 'recreate all indices'
step is taking too long (15-20 minutes on 8.2.4).
I am considering moving to date-based partitioned tables (each table =
one month-year of data, for example). Before I go that far - is there
any other tricks I can or should be using to speed up my bulk data loading?
Thanks,
jason
On Mon, 2007-09-10 at 17:06 -0700, Jason L. Buberel wrote:
When loading very large data exports (> 1 million records) I have
found it necessary to use the following sequence to achieve even
reasonable import performance:1. Drop all indices on the recipient table
2. Use "copy recipient_table from '/tmp/input.file';"
3. Recreate all indices on the recipient tableHowever, I now have tables so large that even the 'recreate all
indices' step is taking too long (15-20 minutes on 8.2.4).I am considering moving to date-based partitioned tables (each table =
one month-year of data, for example). Before I go that far - is there
any other tricks I can or should be using to speed up my bulk data
loading?
If you create the indexes with CONCURRENTLY, then you can write to the
tables while the indexes are being created. That might help reduce your
downtime window.
Regards,
Jeff Davis
On 9/10/07, Jason L. Buberel <jason@buberel.org> wrote:
When loading very large data exports (> 1 million records) I have found it
necessary to use the following sequence to achieve even reasonable import
performance:1. Drop all indices on the recipient table
2. Use "copy recipient_table from '/tmp/input.file';"
3. Recreate all indices on the recipient tableHowever, I now have tables so large that even the 'recreate all indices'
step is taking too long (15-20 minutes on 8.2.4).
Well, that's pretty much the fastest way to import data.
If you've got foreign keys you can look into disabling them while
importing, if they're slowing things down.
I've got indexes that individually take 20 to 30 minutes to create on
one large reporting table. all 8 or so indexes take well over 2 hours
to create. But I don't load it very often.
Note that with 8.2 you can create your new indexes in a non-blocking
mode so that the table is accessible while the indexes are being
created.
I am considering moving to date-based partitioned tables (each table = one
month-year of data, for example). Before I go that far - is there any other
tricks I can or should be using to speed up my bulk data loading?
Partitioning would help. I've used it for tables where I've had to do
stuff like that, and it definitely helps.
Jeff Davis <pgsql@j-davis.com> writes:
On Mon, 2007-09-10 at 17:06 -0700, Jason L. Buberel wrote:
However, I now have tables so large that even the 'recreate all
indices' step is taking too long (15-20 minutes on 8.2.4).
If you create the indexes with CONCURRENTLY, then you can write to the
tables while the indexes are being created. That might help reduce your
downtime window.
Also, I trust you've experimented to find the optimal
maintenance_work_mem for this task on your machine? It's probably
more on 8.2.x than it was before, since we improved the sort code...
regards, tom lane
On Mon, Sep 10, 2007 at 05:06:35PM -0700, Jason L. Buberel wrote:
I am considering moving to date-based partitioned tables (each table =
one month-year of data, for example). Before I go that far - is there
any other tricks I can or should be using to speed up my bulk data loading?
did you try pgbulkload? (http://pgbulkload.projects.postgresql.org/)
depesz
--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA. here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)
Depesz,
Thank you for the suggestion- I thought I had read up on that tool
earlier but had somehow managed to forget about it when starting this
phase of my investigation.
Needless to say, I can confirm the claims made on the project homepage
when using very large data sets.
- Loading 1.2M records into an indexed table:
- pg_bulkload: 5m 29s
- copy to: 53m 20s
These results were obtained using pg-8.2.4 with pg_bulkload-2.2.0.
-jason
hubert depesz lubaczewski wrote:
Show quoted text
On Mon, Sep 10, 2007 at 05:06:35PM -0700, Jason L. Buberel wrote:
I am considering moving to date-based partitioned tables (each table =
one month-year of data, for example). Before I go that far - is there
any other tricks I can or should be using to speed up my bulk data loading?did you try pgbulkload? (http://pgbulkload.projects.postgresql.org/)
depesz