performance of loading CSV data with COPY is 50 times faster than Perl::DBI

Started by Matthias Apitzabout 6 years ago11 messagesgeneral
Jump to latest
#1Matthias Apitz
guru@unixarea.de

Hello,

Since ages, we transfer data between different DBS (Informix, Sybase,
Oracle, and now PostgreSQL) with our own written tool, based on
Perl::DBI which produces a CSV like export in a common way, i.e. an
export of Oracle can be loaded into Sybase and vice versa. Export and
Import is done row by row, for some tables millions of rows.

We produced a special version of the tool to export the rows into a
format which understands the PostgreSQL's COPY command and got to know
that the import into PostgreSQL of the same data with COPY is 50 times
faster than with Perl::DBI, 2.5 minutes ./. 140 minutes for around 6
million rows into an empty table without indexes.

How can COPY do this so fast?

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Matthias Apitz (#1)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

On 1/31/20 10:24 AM, Matthias Apitz wrote:

Hello,

Since ages, we transfer data between different DBS (Informix, Sybase,
Oracle, and now PostgreSQL) with our own written tool, based on
Perl::DBI which produces a CSV like export in a common way, i.e. an
export of Oracle can be loaded into Sybase and vice versa. Export and
Import is done row by row, for some tables millions of rows.

We produced a special version of the tool to export the rows into a
format which understands the PostgreSQL's COPY command and got to know
that the import into PostgreSQL of the same data with COPY is 50 times
faster than with Perl::DBI, 2.5 minutes ./. 140 minutes for around 6
million rows into an empty table without indexes.

How can COPY do this so fast?

Well for one thing COPY does everything in a single transaction, which
is both good and bad. The good is that it is fast, the bad is that a
single error will rollback the entire operation.

COPY also uses it's own method for transferring data. For all the
details see:

https://www.postgresql.org/docs/12/protocol-flow.html#PROTOCOL-COPY

matthias

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Matthias Apitz (#1)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

pá 31. 1. 2020 v 19:25 odesílatel Matthias Apitz <guru@unixarea.de> napsal:

Hello,

Since ages, we transfer data between different DBS (Informix, Sybase,
Oracle, and now PostgreSQL) with our own written tool, based on
Perl::DBI which produces a CSV like export in a common way, i.e. an
export of Oracle can be loaded into Sybase and vice versa. Export and
Import is done row by row, for some tables millions of rows.

We produced a special version of the tool to export the rows into a
format which understands the PostgreSQL's COPY command and got to know
that the import into PostgreSQL of the same data with COPY is 50 times
faster than with Perl::DBI, 2.5 minutes ./. 140 minutes for around 6
million rows into an empty table without indexes.

How can COPY do this so fast?

Probably there are more reasons

1. probably DBI implementation is not too effective (for Postgres), maybe
because COPY is not fault tolerant
2. postgres has not implicit plan cache, so every INSERT planned again and
again
3. COPY bypass planner and executor and it has very effective network
communication
4. with COPY you have a sure so autocommit is disabled.

Regards

Pavel

Show quoted text

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/
+49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

#4Brian Dunavant
dunavant@gmail.com
In reply to: Pavel Stehule (#3)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

You can use COPY over DBI.

https://metacpan.org/pod/DBD::Pg#COPY-support

On Fri, Jan 31, 2020 at 2:03 PM Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Show quoted text

pá 31. 1. 2020 v 19:25 odesílatel Matthias Apitz <guru@unixarea.de>
napsal:

Hello,

Since ages, we transfer data between different DBS (Informix, Sybase,
Oracle, and now PostgreSQL) with our own written tool, based on
Perl::DBI which produces a CSV like export in a common way, i.e. an
export of Oracle can be loaded into Sybase and vice versa. Export and
Import is done row by row, for some tables millions of rows.

We produced a special version of the tool to export the rows into a
format which understands the PostgreSQL's COPY command and got to know
that the import into PostgreSQL of the same data with COPY is 50 times
faster than with Perl::DBI, 2.5 minutes ./. 140 minutes for around 6
million rows into an empty table without indexes.

How can COPY do this so fast?

Probably there are more reasons

1. probably DBI implementation is not too effective (for Postgres), maybe
because COPY is not fault tolerant
2. postgres has not implicit plan cache, so every INSERT planned again and
again
3. COPY bypass planner and executor and it has very effective network
communication
4. with COPY you have a sure so autocommit is disabled.

Regards

Pavel

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/
+49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

#5Steven Lembark
lembark@wrkhors.com
In reply to: Matthias Apitz (#1)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

On Fri, 31 Jan 2020 19:24:41 +0100
Matthias Apitz <guru@unixarea.de> wrote:

Hello,

Since ages, we transfer data between different DBS (Informix, Sybase,
Oracle, and now PostgreSQL) with our own written tool, based on
Perl::DBI which produces a CSV like export in a common way, i.e. an
export of Oracle can be loaded into Sybase and vice versa. Export and
Import is done row by row, for some tables millions of rows.

We produced a special version of the tool to export the rows into a
format which understands the PostgreSQL's COPY command and got to know
that the import into PostgreSQL of the same data with COPY is 50 times
faster than with Perl::DBI, 2.5 minutes ./. 140 minutes for around 6
million rows into an empty table without indexes.

How can COPY do this so fast?

DBI is a wonderful tool, but not intended for bulk transfer. It
is useful for post-processing queries that extract specific
data in ways that SQL cannot readily handle.

One big slowdown is the cycle of pull-a-row, push-a-row involves
signiicant latency due to database connections. That limits the
throughput.

Depending on who wrote the code, they may have extracted the rows
as hashrefs rather than arrays; that can be a 10x slowdown right
there. [I have no idea why so many people are so addicted to storing
rows in hashes, but it is always a significant slowdown; and
array slices are no more complicated than hash slices!]

Where DBI is really nice is managing the copy: generating a
useful basename, determining the path, deciding whether to zip
the output, etc. Using metadata to determine which of the tables
to back up and where to put the result, all of the automation
you'd want to get flexible backups is nice in DBI. Bulk copies,
probably not.

--
Steven Lembark 3646 Flora Place
Workhorse Computing St. Louis, MO 63110
lembark@wrkhors.com +1 888 359 3508

#6Steven Lembark
lembark@wrkhors.com
In reply to: Pavel Stehule (#3)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

2. postgres has not implicit plan cache, so every INSERT planned
again and again

Good point: If you are doing something in a loop you probably
want to use prepare_cached() to get the $sth, for both extract
and insert.

--
Steven Lembark 3646 Flora Place
Workhorse Computing St. Louis, MO 63110
lembark@wrkhors.com +1 888 359 3508

#7Ravi Krishna
ravikrishna@vivaldi.net
In reply to: Steven Lembark (#5)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

Depending on who wrote the code, they may have extracted the rows
as hashrefs rather than arrays; that can be a 10x slowdown right
there. [I have no idea why so many people are so addicted to storing
rows in hashes, but it is always a significant slowdown; and
array slices are no more complicated than hash slices!]

I have not done perl code for a while now, but most perl coders, already
suffering from a complex of coding in an unreadable language do not
prefer to make it worse by using array which is position dependent
and hence reading it can be a nightmare when large number of cols are
selected.

Also isn't array_ref even better than array, since it avoids copying the data
to your local array in the code.

#8Matthias Apitz
guru@unixarea.de
In reply to: Steven Lembark (#5)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

El día lunes, febrero 03, 2020 a las 10:01:04a. m. -0600, Steven Lembark escribió:

On Fri, 31 Jan 2020 19:24:41 +0100
Matthias Apitz <guru@unixarea.de> wrote:

Hello,

Since ages, we transfer data between different DBS (Informix, Sybase,
Oracle, and now PostgreSQL) with our own written tool, based on
Perl::DBI which produces a CSV like export in a common way, i.e. an
export of Oracle can be loaded into Sybase and vice versa. Export and
Import is done row by row, for some tables millions of rows.

We produced a special version of the tool to export the rows into a
format which understands the PostgreSQL's COPY command and got to know
that the import into PostgreSQL of the same data with COPY is 50 times
faster than with Perl::DBI, 2.5 minutes ./. 140 minutes for around 6
million rows into an empty table without indexes.

How can COPY do this so fast?

DBI is a wonderful tool, but not intended for bulk transfer. It
is useful for post-processing queries that extract specific
data in ways that SQL cannot readily handle.

One big slowdown is the cycle of pull-a-row, push-a-row involves
signiicant latency due to database connections. That limits the
throughput.

I should have mentioned this: the export is done on Linux to file and
the import with that tool is read from such files.

matthias

--
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

#9Bret Stern
bret_stern@machinemanagement.com
In reply to: Ravi Krishna (#7)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

"already suffering from a complex of coding in an unreadable language"

hearsay and conjecture

Show quoted text

On 2/3/2020 8:57 AM, Ravi Krishna wrote:

already
suffering from a complex of coding in an unreadable language

#10Ravi Krishna
ravikrishna@vivaldi.net
In reply to: Bret Stern (#9)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

"already suffering from a complex of coding in an unreadable language"

hearsay and conjecture

I was somewhat facetious. But what finished perl was the reputation it earned that it is a write only
language. I don't think that reputation it earned is without merit.

IMO the early adopters of perl were sysadmins and other non programmers who didn't do a stellar
job in writing a highly readable code. To make it worse, perl folks take it as a badge of honor to
write their code in most obfuscated manner. They even get some recognition for that.

#11Ron
ronljohnsonjr@gmail.com
In reply to: Ravi Krishna (#10)
Re: performance of loading CSV data with COPY is 50 times faster than Perl::DBI

On 2/3/20 12:42 PM, Ravi Krishna wrote:

"already suffering from a complex of coding in an unreadable language"

hearsay and conjecture

I was somewhat facetious. But what finished perl was the reputation it earned that it is a write only
language. I don't think that reputation it earned is without merit.

IMO the early adopters of perl were sysadmins and other non programmers who didn't do a stellar
job in writing a highly readable code. To make it worse, perl folks take it as a badge of honor to
write their code in most obfuscated manner. They even get some recognition for that.

No need for an /Obfuscated Perl/ contest, because it's already obfuscated...

--
Angular momentum makes the world go 'round.