Feature proposal

Started by wstrzalkaover 15 years ago17 messagesgeneral
Jump to latest
#1wstrzalka
wstrzalka@gmail.com

I'm currently playing with very large data import using COPY from
file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

Or maybe there is some way to do it? As long as postgres have no read-
uncommited I think I can estimate it only by destination table size ??

#2Wappler, Robert
rwappler@ophardt.com
In reply to: wstrzalka (#1)
Re: Feature proposal

On 2010-08-25, wstrzalka wrote:

I'm currently playing with very large data import using COPY from

file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

Or maybe there is some way to do it? As long as postgres have no read-
uncommited I think I can estimate it only by destination table size ??

-- Sent via pgsql-general mailing list (pgsql-general@postgresql.org)

To

make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Hi,

You can use tools like pv for a progress bar and pipe the output into
psql.

HTH
--
Robert...

#3Denis BUCHER
dbucherml@hsolutions.ch
In reply to: wstrzalka (#1)
Re: Feature proposal

Le 25.08.2010 09:15, wstrzalka a �crit :

I'm currently playing with very large data import using COPY from
file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

Or maybe there is some way to do it? As long as postgres have no read-
uncommited I think I can estimate it only by destination table size ??

By the way, did you try to optimize your postgresql server ?

In my case I was able to reduce a big data update from :
1 hour 15 minutes
to :
5 minutes

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Denis

#4Eric Comeau
ecomeau@signiant.com
In reply to: Denis BUCHER (#3)
Re: Feature proposal

On Wed, 2010-08-25 at 17:06 +0200, Denis BUCHER wrote:

Le 25.08.2010 09:15, wstrzalka a crit :

I'm currently playing with very large data import using COPY from
file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

Or maybe there is some way to do it? As long as postgres have no read-
uncommited I think I can estimate it only by destination table size ??

By the way, did you try to optimize your postgresql server ?

In my case I was able to reduce a big data update from :
1 hour 15 minutes
to :
5 minutes

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Curious- what postgresql.conf settings did you change to improve it?

#5Joshua D. Drake
jd@commandprompt.com
In reply to: Eric Comeau (#4)
Re: Feature proposal

On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Curious- what postgresql.conf settings did you change to improve it?

The most obvious would be to turn fsync off, sychronous_commit off,
increase work_mem, increase checkpoint_timeout, increase wal_segments.

JD

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

#6Steve Clark
sclark@netwolves.com
In reply to: Joshua D. Drake (#5)
Re: Feature proposal

On 08/25/2010 12:30 PM, Joshua D. Drake wrote:

On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Curious- what postgresql.conf settings did you change to improve it?

The most obvious would be to turn fsync off, sychronous_commit off,
increase work_mem, increase checkpoint_timeout, increase wal_segments.

JD

can these be changed on the fly via set commands or does the config file have to be
changed and postgres stopped and restarted.

postgres 8.3.7 on freebsd.

--
Stephen Clark
NetWolves
Sr. Software Engineer III
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
www.netwolves.com

#7wstrzalka
wstrzalka@gmail.com
In reply to: Joshua D. Drake (#5)
Re: Feature proposal

Yea - I'll try to optimize as I had a plan to write to
pgsql.performance for rescue anyway.

I don't know exact hardware specification yet - known facts at the
moment are:
Sun Turgo?? (SPARC) with 32 cores
17GB RAM (1GB for shared buffers)
hdd - ?
OS - Solaris 10 - the system is running in the zone (Solaris
virtualization) - however during test nothing else is utilizing the
machine.
PostgreSQL 8.4.4 64bit

The data set is 9mln rows - about 250 columns
The result database size is ~9GB
Load time ~2h 20min
CPU utilization - 1,2% (half of the one core)
iostat shows writes ~6MB/s, 20% busy
when I run 2 loads in parallel the CPU is split to 2*0,6%, hdd write
~7MB (almost the same)

postgresql.conf changes:
checkpoint_segments - 128
checkpoint_timeout - 30min
shared_buffers - 1GB
maintenance_work_mem - 128MB

does it looks like my HDD is the problem? or maybe the Solaris
virtualization?

what's also interesting - table is empty when I start (by truncate)
but while the COPY is working, I see it grows (by \d+ or
pg_total_relation_size) about 1MB per second
what I'd expect it should grow at checkpoints only, not all the
time - am I wrong?

On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Curious- what postgresql.conf settings did you change to improve it?

The most obvious would be to turn fsync off, sychronous_commit off,
increase work_mem, increase checkpoint_timeout, increase wal_segments.

JD

--
Pozdrowienia,
Wojciech Strzałka

#8John R Pierce
pierce@hogranch.com
In reply to: wstrzalka (#7)
Re: Feature proposal

On 08/25/10 11:47 AM, Wojciech Strzałka wrote:

The data set is 9mln rows - about 250 columns

Having 250 columns in a single table sets off the 'normalization' alarm
in my head.

#9Craig Ringer
craig@2ndquadrant.com
In reply to: Steve Clark (#6)
Re: Feature proposal

On 26/08/2010 1:06 AM, Steve Clark wrote:

On 08/25/2010 12:30 PM, Joshua D. Drake wrote:

On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Curious- what postgresql.conf settings did you change to improve it?

The most obvious would be to turn fsync off, sychronous_commit off,
increase work_mem, increase checkpoint_timeout, increase wal_segments.

JD

can these be changed on the fly via set commands or does the config file
have to be changed and postgres stopped and restarted.

First: Many options can be changed by editing the config file then
telling the postmaster to reload its configuration, rather that
restarting the postmaster. See pg_ctl.

As for the specific options:

Checkpoint and WAL tuning is necessary and important in any real
postgresql instance under load, and it's quite safe to adjust the
checkpoint timeouts and wal segment counts to suit your needs. You'll
need a restart to change the number of wal segments; I'm not so sure
about the checkpoint timeout.

You can't change fsync without a config file edit and a restart. You
should **NEVER** be using fsync=off with data you cannot afford to lose,
so it's a good thing in a way. You might use it to help initially load a
database with bulk data, but fsync should be turned back on and the
database restarted before you start actually using it. fsync=off is
**NOT SAFE**.

synchronous_commit also has effects on data safety. It permits the loss
of transactions committed within the commit delay interval if the server
crashes. If you turn it on, you need to decide how much recent work you
can afford to lose if the database crashes. Not sure if it can be
applied with a reload or whether it requires a full server restart.

So: if you don't know exactly what you're doing, leave fsync alone.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

#10wstrzalka
wstrzalka@gmail.com
In reply to: wstrzalka (#1)
Re: Feature proposal

On 26 Aug, 01:28, pie...@hogranch.com (John R Pierce) wrote:

  On 08/25/10 11:47 AM, Wojciech Strzałka wrote:

  The data set is 9mln rows - about 250 columns

Having 250 columns in a single table sets off the 'normalization' alarm
in my head.

--
Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general

Yeap - but it is as it is.
I need to migrate PG first - then start thinking about schema changes

#11wstrzalka
wstrzalka@gmail.com
In reply to: wstrzalka (#1)
Re: Feature proposal

On 26 Sie, 08:06, wstrzalka <wstrza...@gmail.com> wrote:

On 26 Aug, 01:28, pie...@hogranch.com (John R Pierce) wrote:

  On 08/25/10 11:47 AM, Wojciech Strzałka wrote:

  The data set is 9mln rows - about 250 columns

Having 250 columns in a single table sets off the 'normalization' alarm
in my head.

--
Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general

Yeap - but it is as it is.
I need to migrate PG first - then start thinking about schema changes

So after turning off fsync & synchronous_commit (which I can afford as
I'm populating database from scratch)
I've stucked at 43 minutes for the mentioned table. There is no PK,
constrains, indexes, ... - nothing except for data.

The behaviour changed - I'm utilizing the core 100%, iostat shows the
write peaks about 70MB/s, the table shown by \d+ is growing all the
time as it growth before.
Is there anything I can look at?
Anyway the load to PG is much faster then dump from the old database
and the current load time is acceptable for me.

#12Sam Mason
sam@samason.me.uk
In reply to: wstrzalka (#7)
Re: Feature proposal

On Wed, Aug 25, 2010 at 08:47:10PM +0200, Wojciech Strzaaaka wrote:

The data set is 9mln rows - about 250 columns

250 columns sounds very strange to me as well! I start to getting
worried when I hit a tenth of that.

CPU utilization - 1,2% (half of the one core)
iostat shows writes ~6MB/s, 20% busy
when I run 2 loads in parallel the CPU is split to 2*0,6%, hdd write
~7MB (almost the same)

If you've got indexes set up on the table then I'd expect this sort
of behavior, you could try dropping them before the copy and then
recreating them afterward.

It would be great if PG could do these sorts of bulk index updates
automatically! Maybe run the first few tens/hundred changes in the
main index and then start logging the rows that will need indexing and
bulk process and merge them at the end. Concurrent access seems a bit
more complicated, but shouldn't be too bad. The case of a UNIQUE index
seems to require a change in behavior. For example, the following are
executed concurrently:

Client A: COPY foo (id) FROM stdin;
Client B: INSERT INTO foo (id) VALUES (1);

with A starting before and finishing after B, and A sends a row with
id=1.

At the moment the behavior would be for A's data to be indexed
immediately and hence B's conflicting change would fail. If PG did
bulk index merging at the end, this would change to B's succeeding and
A's failing when the index was brought up to date. These semantics are
still compatible with SQL, just different from before so some code may
be (incorrectly) relying on this.

I've read discussions from:
http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php
and
http://archives.postgresql.org/pgsql-general/2008-01/msg01048.php

but not found much recent. It seems to hold together better than
the first suggestion. Second post notes that you may be better off
working in work_mem batches to help preventing spilling to disk. Sounds
reasonable, and if it's OK to assume the new rows will be physically
close to each other then they can be recorded as ranges/run length
encoded to reduce the chance of spilling to disk for even very large
inserts. As per the second post, I'm struggling with BEFORE INSERT
triggers as well, their semantics seem to preclude most optimizations.

what's also interesting - table is empty when I start (by truncate)
but while the COPY is working, I see it grows (by \d+ or
pg_total_relation_size) about 1MB per second
what I'd expect it should grow at checkpoints only, not all the
time - am I wrong?

AFAIU, it'll constantly grow.

--
Sam http://samason.me.uk/

#13Vick Khera
vivek@khera.org
In reply to: Craig Ringer (#9)
Re: Feature proposal

On Wed, Aug 25, 2010 at 8:48 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:

synchronous_commit also has effects on data safety. It permits the loss of
transactions committed within the commit delay interval if the server
crashes. If you turn it on, you need to decide how much recent work you can
afford to lose if the database crashes. Not sure if it can be applied with a
reload or whether it requires a full server restart.

I routinely set synchronous_commit = off on a per-connection or
per-transaction basis. The beauty of it is that it still honors
transaction boundaries. That is, if there is a server crash the
transaction will be either there or not as a whole; it will not be
partially applied. This works great for bulk imports and changes to
the DB for me, since I can always just re-run my programs on such
failure and everything will pick up where it left off. It takes some
planning but is worth it.

So: if you don't know exactly what you're doing, leave fsync alone.

I agree -- leave fsync alone. You get benefit from synchronous_commit
without the corruption risk.

The other advice on boosting checkpoint segments and timeout are spot
on. Make them pretty big and it will make your import go way faster.
If you have a spare disk on which to move the checkpoint segments so
that you eliminate the seek time on them, move them to get even more
speed. After your import, you can make the number of segments smaller
again if that suits your workload.

#14Adrian von Bidder
avbidder@fortytwo.ch
In reply to: wstrzalka (#1)
Re: Feature proposal

Heyho!

On Wednesday 25 August 2010 09.15:33 wstrzalka wrote:

I'm currently playing with very large data import using COPY from
file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

Recently, I've found (on Linux, don't know if other OSs export this
information) /proc/<pid>/fdinfo/<fd> extremely helpful. It tells you the
position of the file pointer of file number <fd> in process <pid> (I guess
for a COPY import this would be the postgresql backend handling your import
session.)

Unlike other options, you can also use this if you only notice that the
process is long-running after you've already started it.

Of course it probably will not work if the file is mmapped or otherwise not
read in a sequential fashion.

cheers
-- vb

--
All Hail Discordia!

#15Peter Eisentraut
peter_e@gmx.net
In reply to: wstrzalka (#1)
Re: Feature proposal

On ons, 2010-08-25 at 00:15 -0700, wstrzalka wrote:

I'm currently playing with very large data import using COPY from
file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

A feature like this is being worked on:
https://commitfest.postgresql.org/action/patch_view?id=368

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: wstrzalka (#11)
Re: Feature proposal

Excerpts from wstrzalka's message of jue ago 26 03:18:36 -0400 2010:

So after turning off fsync & synchronous_commit (which I can afford as
I'm populating database from scratch)
I've stucked at 43 minutes for the mentioned table. There is no PK,
constrains, indexes, ... - nothing except for data.

Are you truncating the table in the same transaction that copies the
data into it? If you do that, an optimization to skip WAL fires getting
you a nice performance boost. You need to have WAL archiving turned off
though.

Also, if you do that, perhaps there's no point in turning off fsync and
synch_commit because an fsync will be done only once when the copy is
complete.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17wstrzalka
wstrzalka@gmail.com
In reply to: Alvaro Herrera (#16)
Re: Feature proposal

No I don't, but definitely will try tomorrow

Excerpts from wstrzalka's message of jue ago 26 03:18:36 -0400 2010:

So after turning off fsync & synchronous_commit (which I can afford as
I'm populating database from scratch)
I've stucked at 43 minutes for the mentioned table. There is no PK,
constrains, indexes, ... - nothing except for data.

Are you truncating the table in the same transaction that copies the
data into it? If you do that, an optimization to skip WAL fires getting
you a nice performance boost. You need to have WAL archiving turned off
though.

Also, if you do that, perhaps there's no point in turning off fsync and
synch_commit because an fsync will be done only once when the copy is
complete.

--
Pozdrowienia,
Wojciech Strzałka