Parallel pg_dump for 9.1

Started by Joachim Wielandalmost 16 years ago15 messages

joe@mcknight.de

almost 16 years ago

People have been talking about a parallel version of pg_dump a few
times already. I have been working on some proof-of-concept code for
this feature every now and then and I am planning to contribute this
for 9.1.

There are two main issues with a parallel version of pg_dump:

The first one is that it requires a consistent snapshot among multiple
pg_dump clients and the second is that currently the output goes to a
single file and it is unclear what to do about multiple processes
writing into a single file.

- There are ideas on how to solve the issue with the consistent
snapshot but in the end you can always solve it by stopping your
application(s). I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.
Even if we had consistent snapshots in a future version, would we
forbid people to run parallel dumps against old server versions? What
I suggest is to just display a big warning if run against a server
without consistent snapshot support (which currently is every
version).

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

In fact, technically this is rather a parallel pg_restore than a
pg_dump as you need to respect the dependencies between objects. The
good news is that with the parallel pg_restore of the custom archive
format we have everything in place already for this dependency
checking. The addition is a new archive type that dumps (just-in-time)
whatever the dependency-algorithm decides to restore next.

This is probably the fastest way that we can copy or upgrade a
database when pg_migrator cannot be used (for example when you migrate
to a different hardware architecture).

As said, I have some working code for the features described (unix
only), if anybody would like to give it a try already now, just let me
know, I'd be happy to get some early test reports and you could check
for the speedup to expect. But before I continue, I'd like to have a
discussion about what is what people actually want and what is the
best way to go forward here.

I am currently not planning to make parallel dumps work with the
custom format even though this would be possible if we changed the
format to a certain degree.

Comments?

Joachim

David Fetter

david@fetter.org

almost 16 years ago

In reply to: Joachim Wieland (#1)

Re: Parallel pg_dump for 9.1

On Mon, Mar 29, 2010 at 04:46:48PM +0200, Joachim Wieland wrote:

People have been talking about a parallel version of pg_dump a few
times already. I have been working on some proof-of-concept code for
this feature every now and then and I am planning to contribute this
for 9.1.

There are two main issues with a parallel version of pg_dump:

The first one is that it requires a consistent snapshot among
multiple pg_dump clients

Cloning snapshots seems like the way to fix this. I don't know how
far this project has drifted from the PostgreSQL code, but you might
want to look here:

http://pgfoundry.org/projects/snapclone/

and the second is that currently the output goes to a single file
and it is unclear what to do about multiple processes writing into a
single file.

I don't think that's a good idea. Coming up with a directory
structure for the new parallel pg_dump seems like a much better idea.

Andrew, do you have some notes on this?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Robert Haas

robertmhaas@gmail.com

almost 16 years ago

In reply to: Joachim Wieland (#1)

Re: Parallel pg_dump for 9.1

On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote:

- There are ideas on how to solve the issue with the consistent
snapshot but in the end you can always solve it by stopping your
application(s). I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.
Even if we had consistent snapshots in a future version, would we
forbid people to run parallel dumps against old server versions? What
I suggest is to just display a big warning if run against a server
without consistent snapshot support (which currently is every
version).

Seems reasonable.

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

I think that specifying several directories is a piece of complexity
that would be best left alone for a first version of this. But a
single directory with multiple files sounds pretty reasonable. Of
course we'll also need to support that format in non-parallel mode,
and in pg_restore.

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

This is a very interesting idea but you might want to get the other
thing merged first, as it's going to present a different set of
issues.

I am currently not planning to make parallel dumps work with the
custom format even though this would be possible if we changed the
format to a certain degree.

I'm thinking we probably don't want to change the existing formats.

...Robert

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

almost 16 years ago

In reply to: Robert Haas (#3)

Re: Parallel pg_dump for 9.1

Robert Haas wrote:

On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote:

[...]

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

I think that specifying several directories is a piece of complexity
that would be best left alone for a first version of this. But a
single directory with multiple files sounds pretty reasonable. Of
course we'll also need to support that format in non-parallel mode,
and in pg_restore.

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

This is a very interesting idea but you might want to get the other
thing merged first, as it's going to present a different set of
issues.

I had some prior discussion with joachim (and I suspect I had some
influence in him trying to implement that) on that.
The reason why this is really needed is that the current pg_restore -j
is actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that
are basically "duplicate this database to that location" (or any
migration really).
The example at had is a 240GB production database with around 850
tables, it takes ~145min to dump that database single
threaded(completely CPU bound), simply loading the SQL using psql can
restore it in ~150min(again CPU bound both for COPY and index creation),
-j8 brings that down to ~55min.
So if you do the math(and a bit of handwaving):

* using pg_dump | psql you get greatest(140,150) -> 150min.
* using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
* using a theoretical parallel pg_dump and the existing parallel restore
you would get: 50(just a guess for how fast it might be) + 55 -> 105min
* a parallel dump & restore that can pipline would end up at
greatest(50,55)->55min

So a parallel dump alone would only give you a 50% speedup in total time
for doing a migration/upgrade/dump-to-devbox despite the fact that it
uses 8x the resources. A piplined solution would result in a ~3x speedup
in total time and you don't even have to even think about stuff that
might be a problem like having available diskspace on the
source/destination to hold a full temporary dump(if you don't you might
even have to add some transfer time as well).

Stefan

Robert Haas

robertmhaas@gmail.com

almost 16 years ago

In reply to: Stefan Kaltenbrunner (#4)

Re: Parallel pg_dump for 9.1

On Mon, Mar 29, 2010 at 1:16 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

Robert Haas wrote:

On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote:

[...]

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

I think that specifying several directories is a piece of complexity
that would be best left alone for a first version of this. But a
single directory with multiple files sounds pretty reasonable. Of
course we'll also need to support that format in non-parallel mode,
and in pg_restore.

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

This is a very interesting idea but you might want to get the other
thing merged first, as it's going to present a different set of
issues.

I had some prior discussion with joachim (and I suspect I had some influence
in him trying to implement that) on that.
The reason why this is really needed is that the current pg_restore -j is
actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are
basically "duplicate this database to that location" (or any migration
really).
The example at had is a 240GB production database with around 850 tables, it
takes ~145min to dump that database single threaded(completely CPU bound),
simply loading the SQL using psql can restore it in ~150min(again CPU bound
both for COPY and index creation), -j8 brings that down to ~55min.
So if you do the math(and a bit of handwaving):

* using pg_dump | psql you get greatest(140,150) -> 150min.
* using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
* using a theoretical parallel pg_dump and the existing parallel restore you
would get: 50(just a guess for how fast it might be) + 55 -> 105min
* a parallel dump & restore that can pipline would end up at
greatest(50,55)->55min

So a parallel dump alone would only give you a 50% speedup in total time for
doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x
the resources. A piplined solution would result in a ~3x speedup in total
time and you don't even have to even think about stuff that might be a
problem like having available diskspace on the source/destination to hold a
full temporary dump(if you don't you might even have to add some transfer
time as well).

It's a great idea - but there are two features here. I've seen many
patches implementing two features during my relatively short time with
the project and if the rejection rate hasn't been 100% it's certainly
been close. If Joachim thinks he's got it all working, by all means
submit both patches. One can apply over the other if they are
interdependent. But I STRONGLY suggest separating this into two
pieces - it is MUCH easier to get things applied that way, for good
and valid reasons.

...Robert

Josh Berkus

josh@agliodbs.com

almost 16 years ago

In reply to: Joachim Wieland (#1)

Re: Parallel pg_dump for 9.1

On 3/29/10 7:46 AM, Joachim Wieland wrote:

I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.

Actually, I'd say that there's a broad set of cases of people who want
to do a parallel pg_dump while their system is active. Parallel pg_dump
on a stopped system will help some people (for migration, particularly)
but parallel pg_dump with snapshot cloning will help a lot more people.

For example, imagine a user who has a 16-core machine on a 14-drive RAID
10, and a 100-table 1TB database. At 2am, this person might reasonaly
want to allocate a large portion of the machine resources to the dump by
giving it 4 threads, without cutting access to the application.

So: if parallel dump in single-user mode is what you can get done, then
do it. We can always improve it later, and we have to start somewhere.
But we will eventually need parallel pg_dump on active systems, and
that should remain on the TODO list.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Josh Berkus (#6)

Re: Parallel pg_dump for 9.1

Josh Berkus <josh@agliodbs.com> writes:

On 3/29/10 7:46 AM, Joachim Wieland wrote:

I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.

Actually, I'd say that there's a broad set of cases of people who want
to do a parallel pg_dump while their system is active. Parallel pg_dump
on a stopped system will help some people (for migration, particularly)
but parallel pg_dump with snapshot cloning will help a lot more people.

I doubt that. My thought about it is that parallel dump will suck
enough resources from the source server, both disk and CPU, that you
would never want to use it on a live production machine. Not even at
2am. And your proposed use case is hardly a "broad set" in any case.
Thus, Joachim's approach seems perfectly sane from here. I certainly
don't see that there's an argument for spending 10x more development
effort to pick up such use cases.

Another question that's worth asking is exactly what the use case would
be for parallel pg_dump against a live server, whether the snapshots are
synchronized or not. You will not be able to use that dump as a basis
for PITR, so there is no practical way of incorporating any changes that
occur after the dump begins. So what are you making it for? If it's a
routine backup for disaster recovery, fine, but it's not apparent why
you want max speed and to heck with live performance for that purpose.
I think migration to a new server version (that's too incompatible for
PITR or pg_migrate migration) is really the only likely use case.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

almost 16 years ago

In reply to: Tom Lane (#7)

Re: Parallel pg_dump for 9.1

On Mon, Mar 29, 2010 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Josh Berkus <josh@agliodbs.com> writes:

On 3/29/10 7:46 AM, Joachim Wieland wrote:

I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.

Actually, I'd say that there's a broad set of cases of people who want
to do a parallel pg_dump while their system is active. Parallel pg_dump
on a stopped system will help some people (for migration, particularly)
but parallel pg_dump with snapshot cloning will help a lot more people.

I doubt that. My thought about it is that parallel dump will suck
enough resources from the source server, both disk and CPU, that you
would never want to use it on a live production machine. Not even at
2am. And your proposed use case is hardly a "broad set" in any case.
Thus, Joachim's approach seems perfectly sane from here. I certainly
don't see that there's an argument for spending 10x more development
effort to pick up such use cases.

Another question that's worth asking is exactly what the use case would
be for parallel pg_dump against a live server, whether the snapshots are
synchronized or not. You will not be able to use that dump as a basis
for PITR, so there is no practical way of incorporating any changes that
occur after the dump begins. So what are you making it for? If it's a
routine backup for disaster recovery, fine, but it's not apparent why
you want max speed and to heck with live performance for that purpose.
I think migration to a new server version (that's too incompatible for
PITR or pg_migrate migration) is really the only likely use case.

It's completely possible that you could want to clone a server for dev
and have more CPU and I/O bandwidth available than can be efficiently
used by a non-parallel pg_dump. But certainly what Joachim is talking
about will be a good start. I think there is merit to the
synchronized snapshot stuff for pg_dump and perhaps other applications
as well, but I think Joachim's (well-taken) point is that we don't
have to treat it as a hard prerequisite.

...Robert

Andrew Dunstan

andrew@dunslane.net

almost 16 years ago

In reply to: Robert Haas (#8)

Re: Parallel pg_dump for 9.1

Robert Haas wrote:

It's completely possible that you could want to clone a server for dev
and have more CPU and I/O bandwidth available than can be efficiently
used by a non-parallel pg_dump. But certainly what Joachim is talking
about will be a good start. I think there is merit to the
synchronized snapshot stuff for pg_dump and perhaps other applications
as well, but I think Joachim's (well-taken) point is that we don't
have to treat it as a hard prerequisite.

Possibly. I think the most useful thing that could be done right now is
probably the least controversial, namely creating a directory type of
archive, with support for pg_restore, including parallel pg_restore.

Personally I think that's worth doing in its own right anyway.

cheers

andrew

#10

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

almost 16 years ago

In reply to: Tom Lane (#7)

Re: Parallel pg_dump for 9.1

Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

On 3/29/10 7:46 AM, Joachim Wieland wrote:

I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.

Actually, I'd say that there's a broad set of cases of people who want
to do a parallel pg_dump while their system is active. Parallel pg_dump
on a stopped system will help some people (for migration, particularly)
but parallel pg_dump with snapshot cloning will help a lot more people.

I doubt that. My thought about it is that parallel dump will suck
enough resources from the source server, both disk and CPU, that you
would never want to use it on a live production machine. Not even at
2am. And your proposed use case is hardly a "broad set" in any case.
Thus, Joachim's approach seems perfectly sane from here. I certainly
don't see that there's an argument for spending 10x more development
effort to pick up such use cases.

Another question that's worth asking is exactly what the use case would
be for parallel pg_dump against a live server, whether the snapshots are
synchronized or not. You will not be able to use that dump as a basis
for PITR, so there is no practical way of incorporating any changes that
occur after the dump begins. So what are you making it for? If it's a
routine backup for disaster recovery, fine, but it's not apparent why
you want max speed and to heck with live performance for that purpose.
I think migration to a new server version (that's too incompatible for
PITR or pg_migrate migration) is really the only likely use case.

I really doubt that - on fast systems pg_dump is completely CPU
bottlenecked and typical 1-2U typical hardware you get these days has
8-16 cores so simply dedicating a few cores to dumping the database
during quieter times is very realistic.
Databases are growing larger and larger and the single threaded nature
of pg_dump makes it very hard to even stay withing reasonable time
limits for doing the backup.

Stefan

#11

Peter Eisentraut

peter_e@gmx.net

almost 16 years ago

In reply to: Stefan Kaltenbrunner (#10)

Re: Parallel pg_dump for 9.1

On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:

on fast systems pg_dump is completely CPU bottlenecked

Might be useful to profile why that is. I don't think pg_dump has
historically been developed with CPU efficiency in mind.

#12

Pierre C

lists@peufeu.com

almost 16 years ago

In reply to: Peter Eisentraut (#11)

Re: Parallel pg_dump for 9.1

On Tue, 30 Mar 2010 13:01:54 +0200, Peter Eisentraut <peter_e@gmx.net>
wrote:

On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:

on fast systems pg_dump is completely CPU bottlenecked

Might be useful to profile why that is. I don't think pg_dump has
historically been developed with CPU efficiency in mind.

Already done that (I had posted some WIP patches to speed up COPY,
hopefully I'll have time to finish those one day ;)
Most of the time spent in the postmaster process during "COPY TO" is in
the datum -> string functions.

#13

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

almost 16 years ago

In reply to: Peter Eisentraut (#11)

Re: Parallel pg_dump for 9.1

Peter Eisentraut wrote:

On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:

on fast systems pg_dump is completely CPU bottlenecked

Might be useful to profile why that is. I don't think pg_dump has
historically been developed with CPU efficiency in mind.

It's not pg_dump that is the problem - it is COPY that is the limit. In
my specific case als the fact that a lot of the columns are bytea adds
to the horrible CPU overhead (fixed in 9.0). Still our bulk load &
unload performance is still way slower on a per core comparision than a
lot of other databases :(

Stefan

#14

Jeff

threshar@torgo.978.org

almost 16 years ago

In reply to: Stefan Kaltenbrunner (#13)

Re: Parallel pg_dump for 9.1

On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote:

Peter Eisentraut wrote:

On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:

on fast systems pg_dump is completely CPU bottlenecked

Might be useful to profile why that is. I don't think pg_dump has
historically been developed with CPU efficiency in mind.

It's not pg_dump that is the problem - it is COPY that is the limit.
In my specific case als the fact that a lot of the columns are bytea
adds to the horrible CPU overhead (fixed in 9.0). Still our bulk
load & unload performance is still way slower on a per core
comparision than a lot of other databases :(

Don't forget the zlib compression used in -Fc (unless you use -Z0)
takes a fair amount of cpu too.
I did some tests and it turned out that -Z0 actually took longer than -
Z1 simply because there was a lot more data to write out, thus I
became IO bound not CPU bound.

There's a thing called pigz around that is a parallel gzip
implementation - wonder how much of that could be adapted to pg_dumps
use as compression does use a considerable amount of time (even at -
Z1). The biggest problem I can immediately see is that it uses threads.

--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/

#15

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

almost 16 years ago

In reply to: Jeff (#14)

Re: Parallel pg_dump for 9.1

Jeff wrote:

On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote:

Peter Eisentraut wrote:

On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:

on fast systems pg_dump is completely CPU bottlenecked

Might be useful to profile why that is. I don't think pg_dump has
historically been developed with CPU efficiency in mind.

It's not pg_dump that is the problem - it is COPY that is the limit.
In my specific case als the fact that a lot of the columns are bytea
adds to the horrible CPU overhead (fixed in 9.0). Still our bulk load
& unload performance is still way slower on a per core comparision
than a lot of other databases :(

Don't forget the zlib compression used in -Fc (unless you use -Z0) takes
a fair amount of cpu too.
I did some tests and it turned out that -Z0 actually took longer than
-Z1 simply because there was a lot more data to write out, thus I became
IO bound not CPU bound.

There's a thing called pigz around that is a parallel gzip
implementation - wonder how much of that could be adapted to pg_dumps
use as compression does use a considerable amount of time (even at
-Z1). The biggest problem I can immediately see is that it uses threads.

all my numbers are with -Z0 and it is the backend (COPY and/or index
creation) that is the limit. If you start using compression you are
shifting the load to pg_dump.

Stefan