slow speeds after 2 million rows inserted
Greetings,
Ive got a java application I am reading data from a flat file and
inserting it into a table. The first 2 million rows (each file
contained about 1 million lines) went pretty fast. Less than 40 mins to
insert into the database.
After that the insert speed is slow. I think I may be able to type the
data faster than what is being done by the java application on the third
file.
Table looks like this:
CREATE TABLE data_archive
(
id serial NOT NULL,
batchid integer NOT NULL,
claimid character varying(25) NOT NULL,
memberid character varying(45) NOT NULL,
raw_data text NOT NULL,
status integer DEFAULT 0,
line_number integer,
CONSTRAINT data_archive_pkey PRIMARY KEY (id)
)
there is also an index on batchid.
The insert command is like so:
"INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
"', '1', '" + myFilter.claimLine + "');";
where the raw_data variable is the line from the file.
How can I find out what is causing this slow down and how do I speed it up?
Database is 8.2.0 on x86_64-unknown-linux-gnu.
There is nothing else running on this database server (other than
standard linux background programs). PS ax did not show anything else
running. No locks other than the occasional lock by the INSERT query.
I have done a FULL vacuum on this table but not reindex (running now).
Thanks in advance,
James
On Fri, 2006-12-29 at 12:39 -0500, James Neff wrote:
Greetings,
Ive got a java application I am reading data from a flat file and
inserting it into a table. The first 2 million rows (each file
contained about 1 million lines) went pretty fast. Less than 40 mins to
insert into the database.
You need to vacuum during the inserts :)
Joshua D. Drake
After that the insert speed is slow. I think I may be able to type the
data faster than what is being done by the java application on the third
file.Table looks like this:
CREATE TABLE data_archive
(
id serial NOT NULL,
batchid integer NOT NULL,
claimid character varying(25) NOT NULL,
memberid character varying(45) NOT NULL,
raw_data text NOT NULL,
status integer DEFAULT 0,
line_number integer,
CONSTRAINT data_archive_pkey PRIMARY KEY (id)
)there is also an index on batchid.
The insert command is like so:
"INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
"', '1', '" + myFilter.claimLine + "');";where the raw_data variable is the line from the file.
How can I find out what is causing this slow down and how do I speed it up?
Database is 8.2.0 on x86_64-unknown-linux-gnu.
There is nothing else running on this database server (other than
standard linux background programs). PS ax did not show anything else
running. No locks other than the occasional lock by the INSERT query.I have done a FULL vacuum on this table but not reindex (running now).
Thanks in advance,
James---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
James Neff wrote:
Greetings,
Ive got a java application I am reading data from a flat file and
inserting it into a table. The first 2 million rows (each file
contained about 1 million lines) went pretty fast. Less than 40 mins to
insert into the database.After that the insert speed is slow. I think I may be able to type the
data faster than what is being done by the java application on the third
file.Table looks like this:
CREATE TABLE data_archive
(
id serial NOT NULL,
batchid integer NOT NULL,
claimid character varying(25) NOT NULL,
memberid character varying(45) NOT NULL,
raw_data text NOT NULL,
status integer DEFAULT 0,
line_number integer,
CONSTRAINT data_archive_pkey PRIMARY KEY (id)
)there is also an index on batchid.
The insert command is like so:
"INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
"', '1', '" + myFilter.claimLine + "');";where the raw_data variable is the line from the file.
How can I find out what is causing this slow down and how do I speed it up?
Database is 8.2.0 on x86_64-unknown-linux-gnu.
There is nothing else running on this database server (other than
standard linux background programs). PS ax did not show anything else
running. No locks other than the occasional lock by the INSERT query.
I have done a FULL vacuum on this table but not reindex (running now).Thanks in advance,
James---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Consider using copy
there is also an index on batchid.
The insert command is like so:
"INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
"', '1', '" + myFilter.claimLine + "');";
Also as you are running 8.2 you can use multi valued inserts...
INSERT INTO data_archive values () () ()
where the raw_data variable is the line from the file.
How can I find out what is causing this slow down and how do I speed it up?
Database is 8.2.0 on x86_64-unknown-linux-gnu.
There is nothing else running on this database server (other than
standard linux background programs). PS ax did not show anything else
running. No locks other than the occasional lock by the INSERT query.
I have done a FULL vacuum on this table but not reindex (running now).Thanks in advance,
James---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmasterConsider using copy
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
When do you commit these inserts? I occasionally found similiar problems, when I do heavy inserting/updating within one single transaction. First all runs fast, after some time everything slows down. If I commit the inserts every some 1000 rows (large rows, small engine), this phenomenon does not occur. Maybe some buffer chokes if the transaction ist too big.
In your case I'd recommend to commit after every one or two million rows (if possible).
Regards, Frank.
On Fri, 29 Dec 2006 12:39:03 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote:
Greetings,
Ive got a java application I am reading data from a flat file and
inserting it into a table. The first 2 million rows (each file
contained about 1 million lines) went pretty fast. Less than 40 mins to
insert into the database.After that the insert speed is slow. I think I may be able to type the
data faster than what is being done by the java application on the third
file.
--
Frank Finner
Invenius - Lösungen mit Linux
Köpfchenstraße 36
57072 Siegen
Telefon: 0271 231 8606 Mail: frank.finner@invenius.de
Telefax: 0271 231 8608 Web: http://www.invenius.de
Key fingerprint = 90DF FF40 582E 6D6B BADF 6E6A A74E 67E4 E788 2651
Joshua D. Drake wrote:
Also as you are running 8.2 you can use multi valued inserts...
INSERT INTO data_archive values () () ()
Would this speed things up? Or is that just another way to do it?
Thanks,
James
Joshua D. Drake wrote:
You need to vacuum during the inserts :)
Joshua D. Drake
I ran the vacuum during the INSERT and it seemed to help a little, but
its still relatively slow compared to the first 2 million records.
Any other ideas?
Thanks,
James
On Fri, 2006-12-29 at 13:21 -0500, James Neff wrote:
Joshua D. Drake wrote:
Also as you are running 8.2 you can use multi valued inserts...
INSERT INTO data_archive values () () ()
Would this speed things up? Or is that just another way to do it?
The fastest way will be copy.
The second fastest will be multi value inserts in batches.. eg.;
INSERT INTO data_archive values () () () (I don't knwo what the max is)
but commit every 1000 inserts or so.
Sincerely,
Joshua D. Drake
Thanks,
James---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Joshua D. Drake wrote:
On Fri, 2006-12-29 at 13:21 -0500, James Neff wrote:
Joshua D. Drake wrote:
Also as you are running 8.2 you can use multi valued inserts...
INSERT INTO data_archive values () () ()
Would this speed things up? Or is that just another way to do it?
The fastest way will be copy.
The second fastest will be multi value inserts in batches.. eg.;INSERT INTO data_archive values () () () (I don't knwo what the max is)
but commit every 1000 inserts or so.
Sincerely,
Joshua D. Drake
Thanks,
James---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Another thing....if you have to make validations and so on....creating
an temp file with the information validated and using COPY will be
faster. So you can validate and use COPY at the same time....
In Java, assuming you have a Connection c, you simply say "c.commit();" after doing some action on the database. After every commit, the transaction will be executed and closed and a new one opened, which runs until the next commit.
Regards, Frank.
On Fri, 29 Dec 2006 13:23:37 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote:
I'm sorry to ask a stupid question, but how do I 'commit' the transactions?
Thanks,
JamesFrank Finner wrote:
When do you commit these inserts? I occasionally found similiar problems, when I do heavy inserting/updating within one single transaction. First all runs fast, after some time everything slows down. If I commit the inserts every some 1000 rows (large rows, small engine), this phenomenon does not occur. Maybe some buffer chokes if the transaction ist too big.
In your case I'd recommend to commit after every one or two million rows (if possible).
Regards, Frank.
On Fri, 29 Dec 2006 12:39:03 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote:
Greetings,
Ive got a java application I am reading data from a flat file and
inserting it into a table. The first 2 million rows (each file
contained about 1 million lines) went pretty fast. Less than 40 mins to
insert into the database.After that the insert speed is slow. I think I may be able to type the
data faster than what is being done by the java application on the third
file.--
James Neff
Technology SpecialistTethys Health Ventures
4 North Park Drive, Suite 203
Hunt Valley, MD 21030office: 410.771.0692 x103
cell: 443.865.7874
--
Frank Finner
Invenius - Lösungen mit Linux
Köpfchenstraße 36
57072 Siegen
Telefon: 0271 231 8606 Mail: frank.finner@invenius.de
Telefax: 0271 231 8608 Web: http://www.invenius.de
Key fingerprint = 90DF FF40 582E 6D6B BADF 6E6A A74E 67E4 E788 2651
Import Notes
Reply to msg id not found: 45955D29.80704@tethyshealth.com
Frank Finner wrote:
In Java, assuming you have a Connection c, you simply say "c.commit();" after doing some action on the database. After every commit, the transaction will be executed and closed and a new one opened, which runs until the next commit.
Regards, Frank.
That did it, thank you!
--James
The fastest way will be copy.
The second fastest will be multi value inserts in batches.. eg.;INSERT INTO data_archive values () () () (I don't knwo what the max is)
but commit every 1000 inserts or so.
Is this some empirical value? Can someone give heuristics as to how to calculate the optimal number of transactions after which to commit? Or at least guidelines.
Nix.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Import Notes
Resolved by subject fallback
Frank Finner wrote:
In Java, assuming you have a Connection c, you simply say
"c.commit();" after doing some action on the database. After every
commit, the transaction will be executed and closed and a new one
opened, which runs until the next commit.
Assuming, of course, you started with c.setAutoCommit(false);
--
Guy Rouillier
1. There is no difference (speed-wise) between committing every 1K or every 250K rows.
It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.
2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.
3. Table size plays no real factor.
The reason I saw speedup, must have to do with the fact that without transactions, each insert was it's own transaction. That was eating resources.
Nix.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Import Notes
Resolved by subject fallback
It was really some time ago, since I have experimented with this. My las experiment was on PG
7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs.
Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have
not experimented with the transaction batch size, but I suspect that 1,000 would not show much
speedup.2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
Makes sense. Since my application was dumping all records each month and inserting new ones,
vacuum was really needed, but no speedup.3. Table size plays no real factor.
Maybe this link my be useful, it contains additional links to various postgresql preformance
test.
http://archives.postgresql.org/pgsql-general/2006-10/msg00662.php
Regards,
Richard Broersma Jr.
On 12/31/06, Nikola Milutinovic <alokin1@yahoo.com> wrote:
1. There is no difference (speed-wise) between committing every 1K or
every 250K rows.
It was really some time ago, since I have experimented with this. My las
experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows.
Inserting without transactions took 25 hrs. Inserting with 10,000 rows per
transaction took about 2.5 hrs. So, the speedup was 10x. I have not
experimented with the transaction batch size, but I suspect that 1,000 would
not show much speedup.
I would imagine the overhead here is sending the BEGIN/COMMIT (or the
behavior could be version dependent), and the commit is forcing a log
flush. According to the documentation multiple inserts have the potential
of being flushed in one shot when the database wakes up to do a logflush
automatically, so committing more frequently would actually appear to slow
you down.
The time to commit is dependent on hardware, on my hardware it was around
40ms (which you have to question the precision of the calculation, clearly
it is probably less than 40ms and that is an upper limit under load).
My experiment was with 8.2, default configuration, so there is room for
improvement. It was installed from Devrim's RPM packages.
2. Vacuuming also makes no difference for a heavy insert-only table, only
slows it down.Makes sense. Since my application was dumping all records each month and
inserting new ones, vacuum was really needed, but no speedup.
I agree, vacuuming is clearly important. I would also think if you are
going to do a massive one-time update/delete as a maintenance item on a
normally read only table that you should plan on doing a vacuum full to
recover the space used by the "old" rows. Logically the fewer pages on
disk, the less I/O that will result in scenarios where you are doing
sequential scans and probably even many index scans.
It seems that the MVCC implementation would introduce fragmentation (with
respect to a btree indexed field) if your table design had an indexed
creation date field and you often range scanned on that field but also
updated the record then over time the optimizer would less favor the index
as the correlation approached 0. Obviously this is a great feature for a
"last update date" field. :) Not so great if your primary queries are on a
creation date field.
3. Table size plays no real factor.
The reason I saw speedup, must have to do with the fact that without
transactions, each insert was it's own transaction. That was eating
resources.
I would agree. I am also reusing the same statement handles (prepare once,
execute many) with DBD::Pg. The benefit here appears to be that it prepares
the cursor once (one time to parse and generate the execution plan), and
executes the same plan multiple times. The difference in inserts was about
2000 inserts/s!
This is the one of the reasons why everyone keeps saying use COPY instead of
INSERT, COPY is essentially a one time prepare and execute many.
Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351
Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987