bulk inserts

Started by Dave Huberover 16 years ago16 messagesgeneral

DHuber@letourneautechnologies.com

over 16 years ago

Hi, I'm fairly new to postgres and am having trouble finding what I'm looking for. Is there a feature that allows bulk inserts into tables? My setup is Win XPe 2002 SP3 and PostgreSQL 8.3. I need to add entries from a file where each file contains 250 - 500 records. The files are created by a third-party and read/written as binary. The table looks like the following:

CREATE TABLE data_log_20msec_table
(
log_id bigserial NOT NULL,
timestamp_dbl double precision,
data bytea,
CONSTRAINT data_log_20msec_table_pkey PRIMARY KEY (log_id)
)
WITH (OIDS=FALSE);
ALTER TABLE data_log_20msec_table OWNER TO postgres;
-- Index: data_log_20msec_table_timestamp_index
-- DROP INDEX data_log_20msec_table_timestamp_index;
CREATE INDEX data_log_20msec_table_timestamp_index
ON data_log_20msec_table
USING btree
(timestamp_dbl);
The current method for transferring records from the file to postgres is using a prepared statement that is called iteratively on each record read from the file:

INSERT INTO data_log_20msec_table (timestamp_dbl,data) VALUES ($1::double precision,$2::bytea)

Using COPY is out of the question as the file is not formatted for that and since other operations need to occur, the file needs to be read sequentially anyway.

Any amount of help would be gladly excepted, even if it's pointing me to another thread or somewhere in the manual. Thanks,

Dave Huber

________________________________
This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message, together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use, printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email message to the sender and delete all copies of this message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similar import, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachments are an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies, Inc., or any of its subsidiaries, affiliates, or any other person or entity.
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.

Martijn van Oosterhout

kleptog@svana.org

over 16 years ago

In reply to: Dave Huber (#1)

Re: bulk inserts

On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:

Hi, I'm fairly new to postgres and am having trouble finding what I'm
looking for. Is there a feature that allows bulk inserts into tables?
My setup is Win XPe 2002 SP3 and PostgreSQL 8.3. I need to add
entries from a file where each file contains 250 - 500 records. The
files are created by a third-party and read/written as binary. The
table looks like the following:

<snip>

The current method for transferring records from the file to postgres is using a prepared statement that is called iteratively on each record read from the file:

INSERT INTO data_log_20msec_table (timestamp_dbl,data) VALUES ($1::double precision,$2::bytea)

Using COPY is out of the question as the file is not formatted for that and since other operations need to occur, the file needs to be read sequentially anyway.

The usual approach is to use COPY FROM STDIN, then using pqputCopyData
(or whatever it's called). That way you can perform any necessary
munging and don't require the file to be on disk at all.

Have a nice day,

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

Sam Mason

sam@samason.me.uk

over 16 years ago

In reply to: Dave Huber (#1)

Re: bulk inserts

On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:

Using COPY is out of the question as the file is not formatted for
that and since other operations need to occur, the file needs to be
read sequentially anyway.

Just to expand on what Martin said; if you can generate a set of EXECUTE
commands, you can certainly generate a COPY command to insert the same
data. The advantage is a large drop in parse time for inserting larger
numbers of rows. As you're saying you want to insert 500 rows, I'd
suggest at least trying to get COPY working.

--
Sam http://samason.me.uk/

Dave Huber

DHuber@letourneautechnologies.com

over 16 years ago

In reply to: Sam Mason (#3)

Re: bulk inserts

Thanks, Sam and Martijn. I am attempting to use the COPY command now. I had misunderstood what was meant by STDIN and assumed I could only use a file for my application and wasn't aware of PQputCopyData(). One assumption I am operating under right now is that the format of the binary file is the same as the buffer in PQputCopyData, including the header. If I am wrong, could someone please let me know? Thanks,

Dave

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Sam Mason
Sent: Monday, September 28, 2009 3:53 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] bulk inserts

On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:

Using COPY is out of the question as the file is not formatted for
that and since other operations need to occur, the file needs to be
read sequentially anyway.

--
Sam http://samason.me.uk/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message, together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use, printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email message to the sender and delete all copies of this message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similar import, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachments are an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies, Inc., or any of its subsidiaries, affiliates, or any other person or entity.
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.

Sam Mason

sam@samason.me.uk

over 16 years ago

In reply to: Dave Huber (#4)

Re: bulk inserts

On Mon, Sep 28, 2009 at 04:35:53PM -0500, Dave Huber wrote:

One assumption I am operating under right now is
that the format of the binary file is the same as the buffer in
PQputCopyData, including the header. If I am wrong, could someone
please let me know? Thanks,

I've always used ASCII representations of the data; no need to worry
about formats then. Not sure what sort of performance considerations
you have, but it's certainly less of a worry for me. I'm reasonably
sure the binary format changes more often than you'd like, maybe
something like:

http://libpqtypes.esilo.com/

would help. I can't see anything about COPY support, but it should help
with other things.

--
Sam http://samason.me.uk/

Martin Gainty

mgainty@hotmail.com

over 16 years ago

In reply to: Sam Mason (#3)

Re: bulk inserts

INSERTS/UPDATES are historically slow especially with autocommit is on (implied autocommit on)
the Database writer actually stops any processing and applies that one record to the database
Most bulk operations such as import/export and copy are well worth their weight as they apply en-masse
before any commit ..remember the DB actually stops flushes its buffers to Disk
and then resumes..the only solution here is to disable autocommit but be wary you may have 100's of statements waiting to be commited and then someone does a quit on your session..all your work is lost

good call on copy
http://www.postgresql.org/docs/8.1/interactive/populate.html

cheers,
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.

Date: Mon, 28 Sep 2009 21:52:36 +0100
From: sam@samason.me.uk
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] bulk inserts

On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:

Using COPY is out of the question as the file is not formatted for
that and since other operations need to occur, the file needs to be
read sequentially anyway.

Just to expand on what Martin said; if you can generate a set of EXECUTE
commands, you can certainly generate a COPY command to insert the same
data. The advantage is a large drop in parse time for inserting larger
numbers of rows. As you're saying you want to insert 500 rows, I'd
suggest at least trying to get COPY working.

--
Sam http://samason.me.uk/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

_________________________________________________________________
Bing™ brings you maps, menus, and reviews organized in one place. Try it now.
http://www.bing.com/search?q=restaurants&form=MLOGEN&publ=WLHMTAG&crea=TEXT_MLOGEN_Core_tagline_local_1x1

Sam Mason

sam@samason.me.uk

over 16 years ago

In reply to: Martin Gainty (#6)

Re: bulk inserts

On Mon, Sep 28, 2009 at 08:33:45PM -0400, Martin Gainty wrote:

INSERTS/UPDATES are historically slow especially with autocommit is
on (implied autocommit on) the Database writer actually stops any
processing and applies that one record to the database

That seems to be overstating the issue somewhat. Each connection is
still independent and (assuming no locks are taken) will carry on as
such.

Most bulk operations such as import/export and copy are well worth
their weight as they apply en-masse before any commit ..

?? I'm not sure what you're implying about the semantics here, but it
doesn't seem right. COPY doesn't somehow break out of ACID semantics,
it's only an *optimization* that allows you to get large quantities of
data into the database faster. The main reason it's faster is because
parsing CSV data is easier than parsing SQL.

At least I think that's the only difference; anybody know better?

remember the DB actually stops flushes its buffers to Disk and then
resumes..

The DB as a whole does not stop if you issue a commit; just your
session/connection.

the only solution here is to disable autocommit but be wary
you may have 100's of statements waiting to be commited and then
someone does a quit on your session..all your work is lost

I'm not sure what you're saying here. These are normal transactional
semantics and are what all ACID databases are specified to do. You need
to issue a "COMMIT" for data to be committed.

--
Sam http://samason.me.uk/

Dave Huber

DHuber@letourneautechnologies.com

over 16 years ago

In reply to: Martin Gainty (#6)

Re: bulk inserts

All I have to say is wow! COPY works sooo much faster than the iterative method I was using. Even after having to read the entire binary file and reformat the data into the binary format that postgres needs it is an order of magnitude faster than using a prepared INSERT. At least that's what my prelim testing is showing me. I will have to let it run for a while with a larger table to see how it performs in the longrun, but initially this is very positive.

Thanks, guys!

Dave

Grzegorz Jaśkiewicz

gryzman@gmail.com

over 16 years ago

In reply to: Dave Huber (#8)

Re: bulk inserts

On Tue, Sep 29, 2009 at 3:31 PM, Dave Huber <
DHuber@letourneautechnologies.com> wrote:

All I have to say is wow! COPY works sooo much faster than the iterative
method I was using. Even after having to read the entire binary file and
reformat the data into the binary format that postgres needs it is an order
of magnitude faster than using a prepared INSERT. At least that’s what my
prelim testing is showing me. I will have to let it run for a while with a
larger table to see how it performs in the longrun, but initially this is
very positive.

you can also try wrapping whole bunch of inserts in a transaction block.

--
GJ

#10

Alan Hodgson

ahodgson@simkin.ca

over 16 years ago

In reply to: Sam Mason (#7)

Re: bulk inserts

On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:

?? I'm not sure what you're implying about the semantics here, but it
doesn't seem right. COPY doesn't somehow break out of ACID semantics,
it's only an *optimization* that allows you to get large quantities of
data into the database faster. The main reason it's faster is because
parsing CSV data is easier than parsing SQL.

At least I think that's the only difference; anybody know better?

I think a big reason is also that the client can stream the data without
waiting for a network round trip ack on every statement.

--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

#11

Sam Mason

sam@samason.me.uk

over 16 years ago

In reply to: Alan Hodgson (#10)

Re: bulk inserts

On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote:

On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:

it's faster is because
parsing CSV data is easier than parsing SQL.

At least I think that's the only difference; anybody know better?

I think a big reason is also that the client can stream the data without
waiting for a network round trip ack on every statement.

I don't think so. I'm pretty sure you can send multiple statements in a
single round trip. libpq is defined to work in such cases anyway:

http://www.postgresql.org/docs/current/static/libpq-exec.html

--
Sam http://samason.me.uk/

#12

Alan Hodgson

ahodgson@simkin.ca

over 16 years ago

In reply to: Sam Mason (#11)

Re: bulk inserts

On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:

I think a big reason is also that the client can stream the data
without waiting for a network round trip ack on every statement.

I don't think so. I'm pretty sure you can send multiple statements in a
single round trip. libpq is defined to work in such cases anyway:

http://www.postgresql.org/docs/current/static/libpq-exec.html

I'm sure you probably _can_, but how many programming loops do so?

--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

#13

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Sam Mason (#11)

Re: bulk inserts

Sam Mason <sam@samason.me.uk> writes:

On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote:

I think a big reason is also that the client can stream the data without
waiting for a network round trip ack on every statement.

I don't think so. I'm pretty sure you can send multiple statements in a
single round trip.

You can, but that doesn't scale to megabytes of data (at least not
well).

I think the big points are elimination of per-row network and
transaction commit overhead ... but there are some other optimizations
in the COPY path too.

regards, tom lane

#14

Sam Mason

sam@samason.me.uk

over 16 years ago

In reply to: Alan Hodgson (#12)

Re: bulk inserts

On Tue, Sep 29, 2009 at 09:11:19AM -0700, Alan Hodgson wrote:

On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:

I'm pretty sure you can send multiple statements in a
single round trip. libpq is defined to work in such cases anyway:

http://www.postgresql.org/docs/current/static/libpq-exec.html

I'm sure you probably _can_, but how many programming loops do so?

It's not a very big sample, but I'm pretty sure I do! :)

--
Sam http://samason.me.uk/

#15

Sam Mason

sam@samason.me.uk

over 16 years ago

In reply to: Tom Lane (#13)

Re: bulk inserts

On Tue, Sep 29, 2009 at 12:17:51PM -0400, Tom Lane wrote:

Sam Mason <sam@samason.me.uk> writes:

On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote:

I think a big reason is also that the client can stream the data without
waiting for a network round trip ack on every statement.

I don't think so. I'm pretty sure you can send multiple statements in a
single round trip.

You can, but that doesn't scale to megabytes of data (at least not
well).

No, but I didn't think that was being talked about. I was thinking
network round trip time does seem to become a thousand times less
important when you're putting a thousand statements together. This
would seem to imply that network latency can be almost arbitrarily
reduced.

I think the big points are elimination of per-row network and
transaction commit overhead ...

Well, if you start including transaction commit then you've just changed
semantics away from COPY. I was implicitly thinking of what changes
when you keep the same semantics as COPY.

but there are some other optimizations
in the COPY path too.

Cool, I'll continue to prefer COPY then!

--
Sam http://samason.me.uk/

#16

Jasen Betts

jasen@xnet.co.nz

over 16 years ago

In reply to: Dave Huber (#1)

Re: bulk inserts

On 2009-09-29, Alan Hodgson <ahodgson@simkin.ca> wrote:

On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:

?? I'm not sure what you're implying about the semantics here, but it
doesn't seem right. COPY doesn't somehow break out of ACID semantics,
it's only an *optimization* that allows you to get large quantities of
data into the database faster. The main reason it's faster is because
parsing CSV data is easier than parsing SQL.

At least I think that's the only difference; anybody know better?

I think a big reason is also that the client can stream the data without
waiting for a network round trip ack on every statement.

a single insert statement can insert many rows.

the win with copy is more that the data part can be parsed one record
at a time. whereas for the insert the whole must be parsed, also the
quoting rules are simpler for CSV or pg's tab-delimted format.