optimizing import of large CSV file into partitioned table?

Started by Rick Caseyabout 16 years ago6 messagesgeneral

caseyrick@gmail.com

about 16 years ago

After careful research, I would to post the following problem I'm having
with the importing of a large (16Gb) CSV file. Here is brief synopsis:
- this is running on Postgres (PG) version: PostgreSQL 8.3.9 on
i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu 4.3.2-1ubuntu11)
4.3.2
- it is running on a Ubuntu (small) server instance at Amazon Web Services
(AWS), with a 320Gb volume mounted for the PG data directory
- the database was created using the partition example in the documentation,
with an insert trigger and a function to direct which table where records
get inserted.
(see below for code on my table and trigger creation)

After some days of attempting to import the full 16Gb CSV file, I decided to
split the thing up, using the split utility in Linux. This seemed to improve
things; once I had split the CSV files into about 10Mb size files, I finally
got my first successful import of about 257,000 recs. However, this is going
to be a rather labor intensive process to import the full 16Gb file, if I
have to manually split it up, and import each smaller file separately.

So, I am wondering if there is any to optimize this process? I have been
using Postgres for several years, but have never had to partition or
optimize it for files of this size until now.
Any comments or suggestions would be most welcomed from this excellent
forum.

(I might add that I spend several weeks prior to this trying to get this to
work in MySQL, which I finally had to abandon.)

Sincerely,
Rick

Details of the code follow:

Here is the basic COPY command, which I run as the postgres user, to import
the CSV files:
<begin>
COPY allcalls FROM '/data/allcalls-csv/sub3ab' WITH CSV;

Here is what some sample data look like in the files:
3153371867,2008-02-04 16:11:00,1009,1,40
2125673062,2008-02-04 16:11:00,1009,1,41
5183562377,2008-02-04 16:11:00,1009,1,50
...

Here are the basic scripts that created the partition table and insert
trigger:
CREATE TABLE allcalls (
phonenum bigint,
callstarted timestamp without time zone,
status int,
attempts int,
duration int
);
CREATE TABLE allcalls_0 (
CHECK ( phonenum < 1000000000 )
) INHERITS (allcalls);
...(repeat this 9 more times, for 10 subpartition tables)

CREATE INDEX allcalls_0_phonenum ON allcalls_0 (phonenum);
..(repeat this 9 more times, for indexes on the 10 subpartition tables)
CREATE OR REPLACE FUNCTION allcalls_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF ( NEW.phonenum < 1000000000 ) THEN
INSERT INTO allcalls_0 VALUES (NEW.*);
ELSIF ( NEW.phonenum >= 1000000000 AND NEW.phonenum < 2000000000 ) THEN
INSERT INTO allcalls_1 VALUES (NEW.*);
...(again, repeat for rest of the parition tables)

CREATE TRIGGER insert_phonenum_trigger
BEFORE INSERT ON allcalls
FOR EACH ROW EXECUTE PROCEDURE allcalls_insert_trigger();

<end>

thombrown@gmail.com

about 16 years ago

In reply to: Rick Casey (#1)

Re: optimizing import of large CSV file into partitioned table?

On 28 March 2010 18:33, Rick Casey <caseyrick@gmail.com> wrote:

After careful research, I would to post the following problem I'm having
with the importing of a large (16Gb) CSV file. Here is brief synopsis:
- this is running on Postgres (PG) version: PostgreSQL 8.3.9 on
i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu 4.3.2-1ubuntu11)
4.3.2
- it is running on a Ubuntu (small) server instance at Amazon Web Services
(AWS), with a 320Gb volume mounted for the PG data directory
- the database was created using the partition example in the
documentation, with an insert trigger and a function to direct which table
where records get inserted.
(see below for code on my table and trigger creation)

After some days of attempting to import the full 16Gb CSV file, I decided
to split the thing up, using the split utility in Linux. This seemed to
improve things; once I had split the CSV files into about 10Mb size files, I
finally got my first successful import of about 257,000 recs. However, this
is going to be a rather labor intensive process to import the full 16Gb
file, if I have to manually split it up, and import each smaller file
separately.

So, I am wondering if there is any to optimize this process? I have been
using Postgres for several years, but have never had to partition or
optimize it for files of this size until now.
Any comments or suggestions would be most welcomed from this excellent
forum.

(I might add that I spend several weeks prior to this trying to get this to
work in MySQL, which I finally had to abandon.)

Sincerely,
Rick

Details of the code follow:

Here is the basic COPY command, which I run as the postgres user, to import
the CSV files:
<begin>
COPY allcalls FROM '/data/allcalls-csv/sub3ab' WITH CSV;

Here is what some sample data look like in the files:
3153371867,2008-02-04 16:11:00,1009,1,40
2125673062,2008-02-04 16:11:00,1009,1,41
5183562377,2008-02-04 16:11:00,1009,1,50
...

Here are the basic scripts that created the partition table and insert
trigger:
CREATE TABLE allcalls (
phonenum bigint,
callstarted timestamp without time zone,
status int,
attempts int,
duration int
);
CREATE TABLE allcalls_0 (
CHECK ( phonenum < 1000000000 )
) INHERITS (allcalls);
...(repeat this 9 more times, for 10 subpartition tables)

CREATE INDEX allcalls_0_phonenum ON allcalls_0 (phonenum);
..(repeat this 9 more times, for indexes on the 10 subpartition tables)
CREATE OR REPLACE FUNCTION allcalls_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF ( NEW.phonenum < 1000000000 ) THEN
INSERT INTO allcalls_0 VALUES (NEW.*);
ELSIF ( NEW.phonenum >= 1000000000 AND NEW.phonenum < 2000000000 ) THEN
INSERT INTO allcalls_1 VALUES (NEW.*);
...(again, repeat for rest of the parition tables)

CREATE TRIGGER insert_phonenum_trigger
BEFORE INSERT ON allcalls
FOR EACH ROW EXECUTE PROCEDURE allcalls_insert_trigger();

<end>

The problem here is that you appear to require an index update, trigger
firing and constraint check for every single row. First thing I'd suggest
is remove the indexes. Apply that after your import, otherwise it'll have
to update the index for every single entry. And the trigger won't help
either. Import into a single table and split it out into further tables
after if required. And finally the constraint should probably be applied
after too, so cull any violating rows after importing.

Thom

Filip Rembiałkowski

plk.zuber@gmail.com

about 16 years ago

In reply to: Thom Brown (#2)

Re: optimizing import of large CSV file into partitioned table?

2010/3/28 Thom Brown <thombrown@gmail.com>:

The problem here is that you appear to require an index update, trigger
firing and constraint check for every single row. First thing I'd suggest
is remove the indexes. Apply that after your import, otherwise it'll have
to update the index for every single entry.

+1

And the trigger won't help
either. Import into a single table and split it out into further tables
after if required.

note: partitioning could help if there were multiple physical volumes
/ spindles for data directory.
for maximizing performance, I would rather split the CSV input (with
awk/perl/whatever) before loading, to have one backend for each
partition loader.

And finally the constraint should probably be applied
after too, so cull any violating rows after importing.

+1

--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

kirk@bteam.hu

about 16 years ago

In reply to: Thom Brown (#2)

Re: optimizing import of large CSV file into partitioned table?

hi,

i would recommend to convert the input using some scripts into pg_dump format and use small temporary tables
without indexes to import into - and after all data in the db you could partition it...

you can pre-partition your data using simple grep, this way you can import the data directly into a partitioned scheme

kirk

Thom Brown wrote:

On 28 March 2010 18:33, Rick Casey <caseyrick@gmail.com
<mailto:caseyrick@gmail.com>> wrote:

After careful research, I would to post the following problem I'm
having with the importing of a large (16Gb) CSV file. Here is brief
synopsis:
- this is running on Postgres (PG) version: PostgreSQL 8.3.9 on
i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu
4.3.2-1ubuntu11) 4.3.2
- it is running on a Ubuntu (small) server instance at Amazon Web
Services (AWS), with a 320Gb volume mounted for the PG data directory
- the database was created using the partition example in the
documentation, with an insert trigger and a function to direct which
table where records get inserted.
(see below for code on my table and trigger creation)

After some days of attempting to import the full 16Gb CSV file, I
decided to split the thing up, using the split utility in Linux.
This seemed to improve things; once I had split the CSV files into
about 10Mb size files, I finally got my first successful import of
about 257,000 recs. However, this is going to be a rather labor
intensive process to import the full 16Gb file, if I have to
manually split it up, and import each smaller file separately.

So, I am wondering if there is any to optimize this process? I have
been using Postgres for several years, but have never had to
partition or optimize it for files of this size until now.
Any comments or suggestions would be most welcomed from this
excellent forum.

(I might add that I spend several weeks prior to this trying to get
this to work in MySQL, which I finally had to abandon.)

Sincerely,
Rick

Details of the code follow:

Here is the basic COPY command, which I run as the postgres user, to
import the CSV files:
<begin>
COPY allcalls FROM '/data/allcalls-csv/sub3ab' WITH CSV;

Here is what some sample data look like in the files:
3153371867,2008-02-04 16:11:00,1009,1,40
2125673062,2008-02-04 16:11:00,1009,1,41
5183562377,2008-02-04 16:11:00,1009,1,50
...

Here are the basic scripts that created the partition table and
insert trigger:
CREATE TABLE allcalls (
phonenum bigint,
callstarted timestamp without time zone,
status int,
attempts int,
duration int
);
CREATE TABLE allcalls_0 (
CHECK ( phonenum < 1000000000 )
) INHERITS (allcalls);
...(repeat this 9 more times, for 10 subpartition tables)

CREATE INDEX allcalls_0_phonenum ON allcalls_0 (phonenum);
..(repeat this 9 more times, for indexes on the 10 subpartition tables)
CREATE OR REPLACE FUNCTION allcalls_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF ( NEW.phonenum < 1000000000 ) THEN
INSERT INTO allcalls_0 VALUES (NEW.*);
ELSIF ( NEW.phonenum >= 1000000000 AND NEW.phonenum < 2000000000
) THEN
INSERT INTO allcalls_1 VALUES (NEW.*);
...(again, repeat for rest of the parition tables)

CREATE TRIGGER insert_phonenum_trigger
BEFORE INSERT ON allcalls
FOR EACH ROW EXECUTE PROCEDURE allcalls_insert_trigger();

<end>

The problem here is that you appear to require an index update, trigger
firing and constraint check for every single row. First thing I'd
suggest is remove the indexes. Apply that after your import, otherwise
it'll have to update the index for every single entry. And the trigger
won't help either. Import into a single table and split it out into
further tables after if required. And finally the constraint should
probably be applied after too, so cull any violating rows after importing.

Thom

--
Nagy Zoltan (kirk) <kirk@bteam.hu>

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 16 years ago

In reply to: Rick Casey (#1)

Re: optimizing import of large CSV file into partitioned table?

Rick Casey <caseyrick@gmail.com> writes:

So, I am wondering if there is any to optimize this process? I have been using Postgres for several years, but have never had to partition or optimize it for files
of this size until now.
Any comments or suggestions would be most welcomed from this excellent forum.

The pgloader tool will import your data as batches of N lines, you get
to say how many lines you want to consider in each transaction. Plus,
you can have more than one python thread importing your big file, either
sharing one writer and having the other threads doing the parsing and
COPY, or having N independent threads doing the reading/parsing/COPY.

http://pgloader.projects.postgresql.org/

Hope this helps,
--
dim

caseyrick@gmail.com

about 16 years ago

In reply to: Dimitri Fontaine (#5)

Re: optimizing import of large CSV file into partitioned table?

Thanks Dim; I was not aware of pgloader. This, and the other suggestions,
have helped a lot; thanks everyone.

--rick

On Mon, Mar 29, 2010 at 7:41 AM, Dimitri Fontaine <dfontaine@hi-media.com>wrote:

Rick Casey <caseyrick@gmail.com> writes:

So, I am wondering if there is any to optimize this process? I have been

using Postgres for several years, but have never had to partition or
optimize it for files

of this size until now.
Any comments or suggestions would be most welcomed from this excellent

forum.

The pgloader tool will import your data as batches of N lines, you get
to say how many lines you want to consider in each transaction. Plus,
you can have more than one python thread importing your big file, either
sharing one writer and having the other threads doing the parsing and
COPY, or having N independent threads doing the reading/parsing/COPY.

http://pgloader.projects.postgresql.org/

Hope this helps,
--
dim

--
----------------------------------------------------------------------------
Rick Casey :: caseyrick@gmail.com :: 303.345.8893