Allowing parallel pg_restore from pipe
Hi All,
Currently the -j option to pg_restore, which allows for parallelization in
the restore, can only be used if the input file is a regular file and not,
for ex., a pipe. However this is a pretty common occurrence for us
(usually in the form of pg_dump | pg_restore to copy an individual database
or some tables thereof from one machine to another). While there's no good
way to parallelize the data load steps when reading from a pipe, the index
and constraint building can still be parallelized and as they are generally
CPU bound on our machines we've found quite a bit of speedup from doing so.
Attached is two diffs off of the REL9_2_4 tag that I've been using. The
first is a simple change that serially loads the data section before
handing off the remainder of the restore to the existing parallelized
restore code (the .ALT. diff). The second which gets more parallelization
but is a bit more of a change uses the existing dependency analysis code to
allow index building etc. to occur in parallel with data loading. The data
loading tasks are still performed serially in the main thread, but non-data
loading tasks are scheduled in parallel as their dependencies are satisfied
(with the caveat that the main thread can only dispatch new tasks between
data loads).
Anyways, the question is if people think this is generally useful. If so I
can clean up the preferred choice a bit and rebase it off of master, etc.
Tim
Attachments:
0003-patch-pg_restore-to-allow-parallel-restore-when-the.ALT.patchapplication/octet-stream; name=0003-patch-pg_restore-to-allow-parallel-restore-when-the.ALT.patchDownload+45-27
0003-patch-pg_restore-to-allow-parallel-restore-when-the-.patchapplication/octet-stream; name=0003-patch-pg_restore-to-allow-parallel-restore-when-the-.patchDownload+151-74
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
Hi All,
Currently the -j option to pg_restore, which allows for
parallelization in the restore, can only be used if the input file is
a regular file and not, for ex., a pipe. However this is a pretty
common occurrence for us (usually in the form of pg_dump | pg_restore
to copy an individual database or some tables thereof from one machine
to another). While there's no good way to parallelize the data load
steps when reading from a pipe, the index and constraint building can
still be parallelized and as they are generally CPU bound on our
machines we've found quite a bit of speedup from doing so.Attached is two diffs off of the REL9_2_4 tag that I've been using.
The first is a simple change that serially loads the data section
before handing off the remainder of the restore to the existing
parallelized restore code (the .ALT. diff). The second which gets
more parallelization but is a bit more of a change uses the existing
dependency analysis code to allow index building etc. to occur in
parallel with data loading. The data loading tasks are still performed
serially in the main thread, but non-data loading tasks are scheduled
in parallel as their dependencies are satisfied (with the caveat that
the main thread can only dispatch new tasks between data loads).Anyways, the question is if people think this is generally useful. If
so I can clean up the preferred choice a bit and rebase it off of
master, etc.
I don't think these are bad ideas at all, and probably worth doing. Note
that there are some fairly hefty changes affecting this code in master,
so your rebasing could be tricky.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
Anyways, the question is if people think this is generally useful. If so
I can clean up the preferred choice a bit and rebase it off of master,
etc.
I find this idea very useful yes.
Another idea would be to allow for parallel pg_dump output to somehow be
piped into a parallel pg_restore. I don't know how to solve that at all,
it just sound something worthy of doing too.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/24/2013 03:40 PM, Dimitri Fontaine wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
Anyways, the question is if people think this is generally useful. If so
I can clean up the preferred choice a bit and rebase it off of master,
etc.I find this idea very useful yes.
Another idea would be to allow for parallel pg_dump output to somehow be
piped into a parallel pg_restore. I don't know how to solve that at all,
it just sound something worthy of doing too.
That's not going to work, the output from parallel pg_dump is inherently
multiple streams. That's why it ONLY supports directory format, and not
even custom format on disk, let alone a pipe.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/24/2013 03:49 PM, Andrew Dunstan wrote:
On 04/24/2013 03:40 PM, Dimitri Fontaine wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
Anyways, the question is if people think this is generally useful.
If so
I can clean up the preferred choice a bit and rebase it off of master,
etc.I find this idea very useful yes.
Another idea would be to allow for parallel pg_dump output to somehow be
piped into a parallel pg_restore. I don't know how to solve that at all,
it just sound something worthy of doing too.That's not going to work, the output from parallel pg_dump is
inherently multiple streams. That's why it ONLY supports directory
format, and not even custom format on disk, let alone a pipe.
What might make sense is something like pg_dump_restore which would have
no intermediate storage at all, just pump the data etc from one source
to another in parallel. But I pity the poor guy who has to write it :-)
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/24/2013 09:51 PM, Andrew Dunstan wrote:
On 04/24/2013 03:49 PM, Andrew Dunstan wrote:
On 04/24/2013 03:40 PM, Dimitri Fontaine wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
Anyways, the question is if people think this is generally useful.
If so
I can clean up the preferred choice a bit and rebase it off of master,
etc.I find this idea very useful yes.
Another idea would be to allow for parallel pg_dump output to somehow be
piped into a parallel pg_restore. I don't know how to solve that at all,
it just sound something worthy of doing too.That's not going to work, the output from parallel pg_dump is
inherently multiple streams. That's why it ONLY supports directory
format, and not even custom format on disk, let alone a pipe.What might make sense is something like pg_dump_restore which would have
no intermediate storage at all, just pump the data etc from one source
to another in parallel. But I pity the poor guy who has to write it :-)
hmm pretty sure that Joachims initial patch for parallel dump actually
had a PoC for something very similiar to that...
Stefan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner <
stefan@kaltenbrunner.cc> wrote:
What might make sense is something like pg_dump_restore which would have
no intermediate storage at all, just pump the data etc from one source
to another in parallel. But I pity the poor guy who has to write it :-)hmm pretty sure that Joachims initial patch for parallel dump actually
had a PoC for something very similiar to that...
That's right, I implemented that as an own output format and named it
"migrator" I think, which wouldn't write each stream to a file as the
directory output format does but that instead pumps it back into a restore
client.
Actually I think the logic was even reversed, it was a parallel restore
that got the data from internally calling pg_dump functionality instead of
from reading files... The neat thing about this approach was that the order
was optimized and correct, i.e. largest tables start first and dependencies
get resolved in the right order.
I could revisit that patch for 9.4 if enough people are interested.
Joachim
On Wed, Apr 24, 2013 at 6:47 PM, Joachim Wieland <joe@mcknight.de> wrote:
On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:What might make sense is something like pg_dump_restore which would have
no intermediate storage at all, just pump the data etc from one source
to another in parallel. But I pity the poor guy who has to write it :-)hmm pretty sure that Joachims initial patch for parallel dump actually
had a PoC for something very similiar to that...That's right, I implemented that as an own output format and named it
"migrator" I think, which wouldn't write each stream to a file as the
directory output format does but that instead pumps it back into a restore
client.Actually I think the logic was even reversed, it was a parallel restore that
got the data from internally calling pg_dump functionality instead of from
reading files... The neat thing about this approach was that the order was
optimized and correct, i.e. largest tables start first and dependencies get
resolved in the right order.I could revisit that patch for 9.4 if enough people are interested.
Indeed... I've wasted hours copying databases for test environments,
when that could've been hour (singular).
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
As the OP, I'll just note that my organization would definitely find use
for a parallel migrator tool as long as it supported doing a selection of
tables (i.e. -t / -T) in addition to the whole database and it supported or
we were able to patch in an option to cluster as part of the migration (the
equivalent of something like
https://github.com/tgarnett/postgres/commit/cc320a71 ).
Tim
On Wed, Apr 24, 2013 at 5:47 PM, Joachim Wieland <joe@mcknight.de> wrote:
Show quoted text
On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner <
stefan@kaltenbrunner.cc> wrote:What might make sense is something like pg_dump_restore which would have
no intermediate storage at all, just pump the data etc from one source
to another in parallel. But I pity the poor guy who has to write it :-)hmm pretty sure that Joachims initial patch for parallel dump actually
had a PoC for something very similiar to that...That's right, I implemented that as an own output format and named it
"migrator" I think, which wouldn't write each stream to a file as the
directory output format does but that instead pumps it back into a restore
client.Actually I think the logic was even reversed, it was a parallel restore
that got the data from internally calling pg_dump functionality instead of
from reading files... The neat thing about this approach was that the order
was optimized and correct, i.e. largest tables start first and dependencies
get resolved in the right order.I could revisit that patch for 9.4 if enough people are interested.
Joachim
On Wed, Apr 24, 2013 at 5:47 PM, Joachim Wieland <joe@mcknight.de> wrote:
On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner <
stefan@kaltenbrunner.cc> wrote:What might make sense is something like pg_dump_restore which would have
no intermediate storage at all, just pump the data etc from one source
to another in parallelThat's right, I implemented that as an own output format and named it
"migrator" I think, which wouldn't write each stream to a file as the
directory output format does but that instead pumps it back into a restore
client.I could revisit that patch for 9.4 if enough people are interested.
Joachim
As the OP, I'll just note that my organization would definitely find use
for a parallel migrator tool as long as it supported doing a selection of
tables (i.e. -t / -T) in addition to the whole database and it supported or
we were able to patch in an option to cluster as part of the migration (the
equivalent of something like
https://github.com/tgarnett/postgres/commit/cc320a71 ).
Tim
On 04/25/2013 12:56 PM, Timothy Garnett wrote:
As the OP, I'll just note that my organization would definitely find use
for a parallel migrator tool as long as it supported doing a selection
of tables (i.e. -t / -T) in addition to the whole database and it
supported or we were able to patch in an option to cluster as part of
the migration (the equivalent of something like
https://github.com/tgarnett/postgres/commit/cc320a71 ).
If you need something like this short term, we actually found a way to
do it ourselves for a migration we performed back in October. The secret
is xargs with the -P option:
xargs -I{} -P 8 -a table-list.txt \
bash -c "pg_dump -Fc -t {} my_db | pg_restore -h remote -d my_db"
Fill table-list.txt with as many, or as few tables as you want. The
above example would give you 8 parallel threads. Well equipped systems
may be able to increase this.
Admittedly it's a gross hack, but it works. :)
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com
______________________________________________
See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
If you need something like this short term, we actually found a way to do it
ourselves for a migration we performed back in October. The secret is xargs
with the -P option:xargs -I{} -P 8 -a table-list.txt \
bash -c "pg_dump -Fc -t {} my_db | pg_restore -h remote -d my_db"Fill table-list.txt with as many, or as few tables as you want. The above
example would give you 8 parallel threads. Well equipped systems may be able
to increase this.Admittedly it's a gross hack, but it works. :)
I think you'd have to be real careful around foreign key constraints
for that to work.
Tim
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05/16/2013 12:16 PM, Timothy Garnett wrote:
I think you'd have to be real careful around foreign key constraints
for that to work.
Not especially. All you need to do is bootstrap the database with a
bunch of empty table targets (no constraints, keys, etc), then restore
with the xargs command. Then you can apply the constraints, keys, and
indexes later by doing a schema-only parallel pg_restore.
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com
______________________________________________
See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 24, 2013 at 03:33:42PM -0400, Andrew Dunstan wrote:
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
Hi All,
Currently the -j option to pg_restore, which allows for
parallelization in the restore, can only be used if the input file
is a regular file and not, for ex., a pipe. However this is a
pretty common occurrence for us (usually in the form of pg_dump |
pg_restore to copy an individual database or some tables thereof
from one machine to another). While there's no good way to
parallelize the data load steps when reading from a pipe, the
index and constraint building can still be parallelized and as
they are generally CPU bound on our machines we've found quite a
bit of speedup from doing so.Attached is two diffs off of the REL9_2_4 tag that I've been
using. The first is a simple change that serially loads the data
section before handing off the remainder of the restore to the
existing parallelized restore code (the .ALT. diff). The second
which gets more parallelization but is a bit more of a change uses
the existing dependency analysis code to allow index building etc.
to occur in parallel with data loading. The data loading tasks are
still performed serially in the main thread, but non-data loading
tasks are scheduled in parallel as their dependencies are
satisfied (with the caveat that the main thread can only dispatch
new tasks between data loads).Anyways, the question is if people think this is generally useful.
If so I can clean up the preferred choice a bit and rebase it off
of master, etc.I don't think these are bad ideas at all, and probably worth doing.
Note that there are some fairly hefty changes affecting this code in
master, so your rebasing could be tricky.
Is there any progress on this: doing parallel pg_restore from a pipe?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 3, 2013 at 12:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Apr 24, 2013 at 03:33:42PM -0400, Andrew Dunstan wrote:
On 04/23/2013 07:53 PM, Timothy Garnett wrote:
...
Attached is two diffs off of the REL9_2_4 tag that I've been
using. The first is a simple change that serially loads the data
section before handing off the remainder of the restore to the
existing parallelized restore code (the .ALT. diff). The second
which gets more parallelization but is a bit more of a change uses
the existing dependency analysis code to allow index building etc.
to occur in parallel with data loading. The data loading tasks are
still performed serially in the main thread, but non-data loading
tasks are scheduled in parallel as their dependencies are
satisfied (with the caveat that the main thread can only dispatch
new tasks between data loads)....
I don't think these are bad ideas at all, and probably worth doing.
Note that there are some fairly hefty changes affecting this code in
master, so your rebasing could be tricky.Is there any progress on this: doing parallel pg_restore from a pipe?
We're on 9.2 and making make extensive use of the patch in the original
post. I will probably forward port it to 9.3 when we migrate to that
(probably sometime Q1) since we pretty much require it in our setup.
Tim