COPY v. java performance comparison

Started by Rob Sargentabout 12 years ago26 messagesgeneral
Jump to latest
#1Rob Sargent
robjsargent@gmail.com

I'm playing with various data models to compare performance and
practicalities and not sure if I should be surprised by the numbers I'm
getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:

              Table "public.oldstyle"
    +-------------+--------------+-----------+
    |   Column    |     Type     | Modifiers |
    +-------------+--------------+-----------+
    | id          | uuid         | not null  |
    | sample_name | text         | not null  |
    | marker_name | text         | not null  |
    | allele1     | character(1) |           |
    | allele2     | character(1) |           |
    +-------------+--------------+-----------+
    (0. id is a Primary Key)
    (1. Take what you will from the table name.)
    (2. I hadn't thought of "char" type at this point)
    (3. Ultimately the names would become ids, RI included)
    (4. We're loading 39 samples and ~950K markers)

I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
hours (800+ records/sec). Then I tried COPY and killed that after 11.25
hours when I realised that I had added on non-unque index on the name
fields after the first load. By that point is was on line 28301887, so
~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY. This time
it had done 54% of the file in 6.75 hours, extrapolating to roughly 12
hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated with
jOOQ or disappointed with COPY.

Btw, I can load the roughly the same data in to the model below in 10.5
seconds. It only adds 39 very wide lines. I haven't got to the
practicality bits yet :)

Table "public.chipcall"
+-------------+------+-----------+
| Column | Type | Modifiers |
+-------------+------+-----------+
| id | uuid | |
| sample_name | text | |
| chip_model | uuid | |
| gta | text | |
+-------------+------+-----------+

This just a dev desktop environment:
RHEL 6.5
PostgreSQL 9.3.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.4.7 20120313 (Red Hat 4.4.7-4 [hm, interesting contradiction]), 64-bit
2 QuadCore cpu MHz: 2925.878
The input file and postgres data are on separate disks, but only one
controller.

Thanks in advance, even if you only read this far.

#2Steve Atkins
steve@blighty.com
In reply to: Rob Sargent (#1)
Re: COPY v. java performance comparison

On Apr 2, 2014, at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you read http://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers,
Steve

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Rob Sargent
robjsargent@gmail.com
In reply to: Steve Atkins (#2)
Re: COPY v. java performance comparison

On 04/02/2014 01:56 PM, Steve Atkins wrote:

On Apr 2, 2014, at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you read http://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers,
Steve

The copy command was pretty vanilla:

copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
with delimiter ' ';

I've been to that page, but (as I read them) none sticks out as a sure
thing. I'm not so worried about the actual performance as I am with the
relative throughput (sixes so far).

I'm not cpu bound, but I confess I didn't look at io stats during the
copy runs. I just assume it was pegged :)

Thanks,

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rob Sargent (#3)
Re: COPY v. java performance comparison

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

On Apr 2, 2014, at 12:37 PM, Rob Sargent<robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers,
Steve

The copy command was pretty vanilla:

copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
with delimiter ' ';

I've been to that page, but (as I read them) none sticks out as a sure
thing. I'm not so worried about the actual performance as I am with the
relative throughput (sixes so far).

Have you looked at the Postgres logs from that time period to see if
there is anything of interest, say complaining about checkpoints.

I'm not cpu bound, but I confess I didn't look at io stats during the
copy runs. I just assume it was pegged :)

Thanks,

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Bill Moran
wmoran@potentialtech.com
In reply to: Adrian Klaver (#4)
Re: COPY v. java performance comparison

Just for a comparison ... I wrote a Java program that copies data from
MySQL -> Postgres, using the Copy impelmentation in the JDBC driver. I've
occasionally seen 50,000+ rows/sec from this program, but the speed is
highly dependent on the table structure. Tables that are very wide tend
to run slower (for example).

On Wed, 02 Apr 2014 13:36:27 -0700
Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

On Apr 2, 2014, at 12:37 PM, Rob Sargent<robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:
Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you readhttp://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers,
Steve

The copy command was pretty vanilla:

copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
with delimiter ' ';

I've been to that page, but (as I read them) none sticks out as a sure
thing. I'm not so worried about the actual performance as I am with the
relative throughput (sixes so far).

Have you looked at the Postgres logs from that time period to see if
there is anything of interest, say complaining about checkpoints.

I'm not cpu bound, but I confess I didn't look at io stats during the
copy runs. I just assume it was pegged :)

Thanks,

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Bill Moran <wmoran@potentialtech.com>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Steve Atkins
steve@blighty.com
In reply to: Rob Sargent (#3)
Re: COPY v. java performance comparison

On Apr 2, 2014, at 1:14 PM, Rob Sargent <robjsargent@gmail.com> wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

On Apr 2, 2014, at 12:37 PM, Rob Sargent <robjsargent@gmail.com>
wrote:

Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.

That seems rather painfully slow. How exactly are you doing the bulk load? Are you CPU limited or disk limited?

Have you read
http://www.postgresql.org/docs/current/interactive/populate.html
?

Cheers,
Steve

The copy command was pretty vanilla:
copy oldstyle from '/export/home/rob/share/testload/<file-redacted>' with delimiter ' ';
I've been to that page, but (as I read them) none sticks out as a sure thing. I'm not so worried about the actual performance as I am with the relative throughput (sixes so far).

I'm not cpu bound, but I confess I didn't look at io stats during the copy runs. I just assume it was pegged :)

If each row is, say, 100 bytes including the per-row overhead (plausible for a uuid and a couple of strings), and you're inserting 800 rows a second, that's 80k/second, which would be fairly pathetic.

On my laptop (which has an SSD, sure, but it's still a laptop) I can insert 40M rows of data that has a few integers and a few small strings in about 52 seconds. And that's just using a simple, single-threaded load using psql to run copy from stdin, reading from the same disk as the DB is on, with no tuning of any parameters to speed up the load.

12 hours suggests there's something fairly badly wrong with what you're doing. I'd definitely look at the server logs, check system load and double check what you're actually running.

(Running the same thing on a tiny VM, one that shares a single RAID5 of 7200rpm drives with about 40 other VMs, takes a shade under two minutes, mostly CPU bound).

Cheers,
Steve

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Rob Sargent
robjsargent@gmail.com
In reply to: Adrian Klaver (#4)
Re: COPY v. java performance comparison

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Impatience got the better of me and I killed the second COPY. This
time it had done 54% of the file in 6.75 hours, extrapolating to
roughly 12 hours to do the whole thing.

That seems rather painfully slow. How exactly are you doing the bulk
load? Are you CPU limited or disk limited?

Have you
readhttp://www.postgresql.org/docs/current/interactive/populate.html ?

Cheers,
Steve

The copy command was pretty vanilla:

copy oldstyle from '/export/home/rob/share/testload/<file-redacted>'
with delimiter ' ';

I've been to that page, but (as I read them) none sticks out as a sure
thing. I'm not so worried about the actual performance as I am with the
relative throughput (sixes so far).

Have you looked at the Postgres logs from that time period to see if
there is anything of interest, say complaining about checkpoints.

I'm not cpu bound, but I confess I didn't look at io stats during the
copy runs. I just assume it was pegged :)

Thanks,

Well indeed there are copious LOG/HINT pairs along the lines of

LOG: checkpoints are occurring too frequently ([8 <= n <=29]
seconds apart)
HINT: Consider increasing the configuration parameter
"checkpoint_segments".

and these are during non-load periods. During the COPYs I see
uncountable numbers of

WARNING: pgstat wait timeout

As you guys have probably already concluded, I have some config/tuning
to do.

Cheers,
rjs

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rob Sargent (#7)
Re: COPY v. java performance comparison

On 04/02/2014 02:04 PM, Rob Sargent wrote:

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Well indeed there are copious LOG/HINT pairs along the lines of

LOG: checkpoints are occurring too frequently ([8 <= n <=29]
seconds apart)
HINT: Consider increasing the configuration parameter
"checkpoint_segments".

and these are during non-load periods. During the COPYs I see
uncountable numbers of

WARNING: pgstat wait timeout

As you guys have probably already concluded, I have some config/tuning
to do.

At a guess some of the problem comes from here:

"The input file and postgres data are on separate disks, but only one
controller."

Have you tried moving the input file to the same disk as the server, to
factor out the controller?

Cheers,
rjs

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Rob Sargent
robjsargent@gmail.com
In reply to: Adrian Klaver (#8)
Re: COPY v. java performance comparison

On 04/02/2014 03:11 PM, Adrian Klaver wrote:

On 04/02/2014 02:04 PM, Rob Sargent wrote:

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Well indeed there are copious LOG/HINT pairs along the lines of

LOG: checkpoints are occurring too frequently ([8 <= n <=29]
seconds apart)
HINT: Consider increasing the configuration parameter
"checkpoint_segments".

and these are during non-load periods. During the COPYs I see
uncountable numbers of

WARNING: pgstat wait timeout

As you guys have probably already concluded, I have some config/tuning
to do.

At a guess some of the problem comes from here:

"The input file and postgres data are on separate disks, but only one
controller."

Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Jeff Janes
jeff.janes@gmail.com
In reply to: Rob Sargent (#1)
Re: COPY v. java performance comparison

On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com> wrote:

I'm playing with various data models to compare performance and
practicalities and not sure if I should be surprised by the numbers I'm
getting. I hope this report isn't too wishy-washy for reasoned comment.

One model says a genotype is defined as follows:

Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)

I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours
(800+ records/sec). Then I tried COPY and killed that after 11.25 hours
when I realised that I had added on non-unque index on the name fields
after the first load. By that point is was on line 28301887, so ~0.75 done
which implies it would have take ~15hours to complete.

Would the overhead of the index likely explain this decrease in throughput?

Absolutely.

Impatience got the better of me and I killed the second COPY. This time
it had done 54% of the file in 6.75 hours, extrapolating to roughly 12
hours to do the whole thing.

Are you sure you actually dropped the indices? (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t",
"a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate
oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values
in sort-order?

Cheers,

Jeff

#11Rob Sargent
robjsargent@gmail.com
In reply to: Jeff Janes (#10)
Re: COPY v. java performance comparison

On 04/02/2014 04:36 PM, Jeff Janes wrote:

On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com
<mailto:robjsargent@gmail.com>> wrote:

I'm playing with various data models to compare performance and
practicalities and not sure if I should be surprised by the
numbers I'm getting. I hope this report isn't too wishy-washy for
reasoned comment.

One model says a genotype is defined as follows:

Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)

I loaded 37M+ records using jOOQ (batching every 1000 lines) in
12+ hours (800+ records/sec). Then I tried COPY and killed that
after 11.25 hours when I realised that I had added on non-unque
index on the name fields after the first load. By that point is
was on line 28301887, so ~0.75 done which implies it would have
take ~15hours to complete.

Would the overhead of the index likely explain this decrease in
throughput?

Absolutely.

Impatience got the better of me and I killed the second COPY.
This time it had done 54% of the file in 6.75 hours, extrapolating
to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices? (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers,
constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join
"\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c
'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the
values in sort-order?

Cheers,

Jeff

No I'll leave the pk in at the very least. My example load (37M
records) will not be the last word by any means. That's one experiment,
if you will. My goal is not to see how fast I can get records in,
rather to see what I can expect going forward. Yes, I'm pretty sure I
dropped the index afore the second kick at copy.
I'm about restart after some config changes (doubled the
checkpoint_segments - I have no idea what the value should be.)

Hope you'll stay tuned.

#12Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rob Sargent (#9)
Re: COPY v. java performance comparison

On 04/02/2014 02:27 PM, Rob Sargent wrote:

On 04/02/2014 03:11 PM, Adrian Klaver wrote:

On 04/02/2014 02:04 PM, Rob Sargent wrote:

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an
approximate row/sec rate.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Rob Sargent
robjsargent@gmail.com
In reply to: Jeff Janes (#10)
Re: COPY v. java performance comparison

On 04/02/2014 04:36 PM, Jeff Janes wrote:

On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent@gmail.com
<mailto:robjsargent@gmail.com>> wrote:

I'm playing with various data models to compare performance and
practicalities and not sure if I should be surprised by the
numbers I'm getting. I hope this report isn't too wishy-washy for
reasoned comment.

One model says a genotype is defined as follows:

Table "public.oldstyle"
+-------------+--------------+-----------+
|   Column    |     Type     | Modifiers |
+-------------+--------------+-----------+
| id          | uuid         | not null  |
| sample_name | text         | not null  |
| marker_name | text         | not null  |
| allele1     | character(1) |           |
| allele2     | character(1) |           |
+-------------+--------------+-----------+
(0. id is a Primary Key)
(1. Take what you will from the table name.)
(2. I hadn't thought of "char" type at this point)
(3. Ultimately the names would become ids, RI included)
(4. We're loading 39 samples and ~950K markers)

I loaded 37M+ records using jOOQ (batching every 1000 lines) in
12+ hours (800+ records/sec). Then I tried COPY and killed that
after 11.25 hours when I realised that I had added on non-unque
index on the name fields after the first load. By that point is
was on line 28301887, so ~0.75 done which implies it would have
take ~15hours to complete.

Would the overhead of the index likely explain this decrease in
throughput?

Absolutely.

Impatience got the better of me and I killed the second COPY.
This time it had done 54% of the file in 6.75 hours, extrapolating
to roughly 12 hours to do the whole thing.

Are you sure you actually dropped the indices? (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers,
constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join
"\t", "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c
'truncate oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the
values in sort-order?

Cheers,

Jeff

I restarted the java-based loading. Still with pk in place, (and
actually generating the UUIDs etc) I got ~1.1M rows in one minute. And
one "LOG: checkpoints are occurring too frequently (24 seconds apart)"
with checkpoint_segment now at 6 (v. default 3). In plotting the
records v. time in that minute at least it's pretty linear. The time
per batch is relatively constant with a few hiccups. (.5 sec per 10K
lines). So I've improved things immensely but not entirely sure that
the simple config change is the reason. If this continued I would be
done inside 40 minutes.

With copy interuptus I now get 2.9M records per minute so the load would
take only 12 or so minutes. I did get four reports of too-frequent
checkpoints 2 at 15 seconds 2 at 9 seconds.

I'll need to let each on go to completion.

If these numbers are at all accurate and realistic, I'm still impressed
with jOOQ, though COPY is rightfully back to its proper place as fastest
way to load.

#14Rob Sargent
robjsargent@gmail.com
In reply to: Adrian Klaver (#12)
Re: COPY v. java performance comparison

On 04/02/2014 06:06 PM, Adrian Klaver wrote:

On 04/02/2014 02:27 PM, Rob Sargent wrote:

On 04/02/2014 03:11 PM, Adrian Klaver wrote:

On 04/02/2014 02:04 PM, Rob Sargent wrote:

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an
approximate row/sec rate.

Well things slow down over time, and lots of "too frequent"s:

Have done 500 batches in 24219 ms
Have done 1000 batches in 52362 ms
Have done 1500 batches in 82256 ms
Have done 2000 batches in 113754 ms
Have done 2500 batches in 149637 ms
Have done 3000 batches in 211314 ms
Have done 3500 batches in 301989 ms
Have done 4000 batches in 430817 ms
Have done 4500 batches in 596043 ms
Have done 5000 batches in 804250 ms

where a batch is 500,000 lines. This on the java side of course.

#15Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Rob Sargent (#14)
Re: COPY v. java performance comparison

On 04/02/2014 05:30 PM, Rob Sargent wrote:

On 04/02/2014 06:06 PM, Adrian Klaver wrote:

On 04/02/2014 02:27 PM, Rob Sargent wrote:

On 04/02/2014 03:11 PM, Adrian Klaver wrote:

On 04/02/2014 02:04 PM, Rob Sargent wrote:

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an
approximate row/sec rate.

Well things slow down over time, and lots of "too frequent"s:

Have done 500 batches in 24219 ms
Have done 1000 batches in 52362 ms
Have done 1500 batches in 82256 ms
Have done 2000 batches in 113754 ms
Have done 2500 batches in 149637 ms
Have done 3000 batches in 211314 ms
Have done 3500 batches in 301989 ms
Have done 4000 batches in 430817 ms
Have done 4500 batches in 596043 ms
Have done 5000 batches in 804250 ms

where a batch is 500,000 lines. This on the java side of course.

This is drive to drive or on single drive?

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16Rob Sargent
robjsargent@gmail.com
In reply to: Adrian Klaver (#15)
Re: COPY v. java performance comparison

On 04/02/2014 08:40 PM, Adrian Klaver wrote:

On 04/02/2014 05:30 PM, Rob Sargent wrote:

On 04/02/2014 06:06 PM, Adrian Klaver wrote:

On 04/02/2014 02:27 PM, Rob Sargent wrote:

On 04/02/2014 03:11 PM, Adrian Klaver wrote:

On 04/02/2014 02:04 PM, Rob Sargent wrote:

On 04/02/2014 02:36 PM, Adrian Klaver wrote:

On 04/02/2014 01:14 PM, Rob Sargent wrote:

On 04/02/2014 01:56 PM, Steve Atkins wrote:

Have you tried moving the input file to the same disk as the server,
to factor out the controller?

I labour under the delusion that it is through the controller one
reads
and writes and that there might be some slight advantage to not doing
both against one drive if avoidable. Wrong again?

Well there is one way to find out:)

Might try with something less then the whole file to get come up an
approximate row/sec rate.

Well things slow down over time, and lots of "too frequent"s:

Have done 500 batches in 24219 ms
Have done 1000 batches in 52362 ms
Have done 1500 batches in 82256 ms
Have done 2000 batches in 113754 ms
Have done 2500 batches in 149637 ms
Have done 3000 batches in 211314 ms
Have done 3500 batches in 301989 ms
Have done 4000 batches in 430817 ms
Have done 4500 batches in 596043 ms
Have done 5000 batches in 804250 ms

where a batch is 500,000 lines. This on the java side of course.

This is drive to drive or on single drive?

Same as first go round, drive to drive.

#17Thomas Kellerer
spam_eater@gmx.net
In reply to: Rob Sargent (#1)
Re: COPY v. java performance comparison

Rob Sargent, 02.04.2014 21:37:

I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
hours (800+ records/sec). Then I tried COPY and killed that after
11.25 hours when I realised that I had added on non-unque index on
the name fields after the first load. By that point is was on line
28301887, so ~0.75 done which implies it would have take ~15hours to
complete.

Would the overhead of the index likely explain this decrease in
throughput?

Impatience got the better of me and I killed the second COPY. This
time it had done 54% of the file in 6.75 hours, extrapolating to
roughly 12 hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated
with jOOQ or disappointed with COPY.

This is not what I see with COPY FROM STDIN

When I load 2million rows using a batch size of 1000 with plain JDBC that takes about 4 minutes

Loading the same file through Java and COPY FROM STDIN takes about 4 seconds

The table looks like this:

Table "public.products"
Column | Type | Modifiers
-------------------+------------------------+-----------
product_id | integer | not null
ean_code | bigint | not null
product_name | character varying(100) | not null
manufacturer_name | character varying | not null
price | numeric(10,2) | not null
publish_date | date | not null
Indexes:
"products_pkey" PRIMARY KEY, btree (product_id)
"idx_publish_date" btree (publish_date, product_id)

During the load both indexes are present.

Regards
Thomas

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18Andy Colson
andy@squeakycode.net
In reply to: Rob Sargent (#14)
Re: COPY v. java performance comparison

On 4/2/2014 7:30 PM, Rob Sargent wrote:

Well things slow down over time, and lots of "too frequent"s:

Have done 500 batches in 24219 ms
Have done 1000 batches in 52362 ms
Have done 1500 batches in 82256 ms
Have done 2000 batches in 113754 ms
Have done 2500 batches in 149637 ms
Have done 3000 batches in 211314 ms
Have done 3500 batches in 301989 ms
Have done 4000 batches in 430817 ms
Have done 4500 batches in 596043 ms
Have done 5000 batches in 804250 ms

where a batch is 500,000 lines. This on the java side of course.

Have you watched cpu usage of the java process vs the PG process in top?
I had a perl program importing data that was the bottleneck because it
was calling rand(). I'll bet generating uuid's is cpu intensive too.

Is the java app cpu bound?

Also watch "vmstat 3" for a minute or two. The last two numbers (wa &
id) (some vmstat's have a steal, ignore that) will tell you if you are
io bound.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19Rob Sargent
robjsargent@gmail.com
In reply to: Thomas Kellerer (#17)
Re: COPY v. java performance comparison

On 04/03/2014 09:01 AM, Thomas Kellerer wrote:

Rob Sargent, 02.04.2014 21:37:

I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
hours (800+ records/sec). Then I tried COPY and killed that after
11.25 hours when I realised that I had added on non-unque index on
the name fields after the first load. By that point is was on line
28301887, so ~0.75 done which implies it would have take ~15hours to
complete.

Would the overhead of the index likely explain this decrease in
throughput?

Impatience got the better of me and I killed the second COPY. This
time it had done 54% of the file in 6.75 hours, extrapolating to
roughly 12 hours to do the whole thing.

That matches up with the java speed. Not sure if I should be elated
with jOOQ or disappointed with COPY.

This is not what I see with COPY FROM STDIN

When I load 2million rows using a batch size of 1000 with plain JDBC that takes about 4 minutes

Loading the same file through Java and COPY FROM STDIN takes about 4 seconds

The table looks like this:

Table "public.products"
Column | Type | Modifiers
-------------------+------------------------+-----------
product_id | integer | not null
ean_code | bigint | not null
product_name | character varying(100) | not null
manufacturer_name | character varying | not null
price | numeric(10,2) | not null
publish_date | date | not null
Indexes:
"products_pkey" PRIMARY KEY, btree (product_id)
"idx_publish_date" btree (publish_date, product_id)

During the load both indexes are present.

Regards
Thomas

Thomas thanks for these numbers.

I have to straighten out my environment, which I admit I was hoping to
avoid. I reset checkpoint_segments to 12 and restarted my server.
I kicked of the COPY at 19:00. That generated a couple of the "too
frequent" statements but 52 "WARNING: pgstat wait timeout" lines during
the next 8 hours starting at 00:37 (5 hours in) 'til finally keeling
over at 03:04 on line 37363768. That's the last line of the input so
obviously I didn't flush my last println properly. I'm beyond getting
embarrassed at this point.

Is turning auto-vacuum off a reasonable way through this?

#20Rob Sargent
robjsargent@gmail.com
In reply to: Andy Colson (#18)
Re: COPY v. java performance comparison

Is the java app cpu bound?

Also watch "vmstat 3" for a minute or two. The last two numbers (wa
& id) (some vmstat's have a steal, ignore that) will tell you if you
are io bound.

-Andy

During COPY, with autovaccume off (server restarted, manual vacuum to
get things going). Immediately hit the "too frequently" gripe.
checkpoint_segments still at 12.

Not cpu bound, so waiting for IO, but surely one expects that?

vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 2 100148 3317248 763264 33731032 0 0 0 33355 10555 20122
1 1 84 14 0
1 0 100148 3311040 763272 33736972 0 0 0 18128 11118 20601
5 1 87 6 0
1 0 100148 3277808 763272 33769008 0 0 0 12536 11373 20551
7 2 90 1 0
1 1 100148 3267020 763272 33778732 0 0 0 43125 10968 20848
5 2 90 3 0
0 1 100148 3250528 763272 33792804 0 0 0 63567 10467 20121
2 1 81 16 0
0 1 100148 3258584 763276 33784308 0 0 0 17121 10553 20146
1 1 86 12 0
1 2 100148 3237616 763276 33802024 0 0 0 36327 11198 20151
4 2 85 9 0
1 0 100148 3223224 763276 33816556 0 0 0 32189 10763 19900
2 1 84 13 0
0 0 100148 3206740 763276 33831580 0 0 0 13233 11080 20469
7 3 89 1 0
0 2 100148 3204872 763276 33829580 0 0 0 75205 10500 20912
2 1 84 13 0
0 2 100148 3188256 763276 33847544 0 0 0 35448 11028 20788
3 2 86 9 0
0 2 100148 3190248 763276 33844888 0 0 0 21808 11938 20848
2 1 82 15 0
0 1 100148 3184656 763280 33848968 0 0 0 18592 11562 20574
1 1 84 14 0

#21Andy Colson
andy@squeakycode.net
In reply to: Rob Sargent (#20)
#22Andy Colson
andy@squeakycode.net
In reply to: Andy Colson (#21)
#23Jeff Janes
jeff.janes@gmail.com
In reply to: Rob Sargent (#11)
#24Jeff Janes
jeff.janes@gmail.com
In reply to: Rob Sargent (#13)
#25Jeff Janes
jeff.janes@gmail.com
In reply to: Rob Sargent (#19)
#26Rob Sargent
robjsargent@gmail.com
In reply to: Jeff Janes (#25)