pg_dump additional options for performance
pg_dump allows you to specify -s --schema-only, or -a --data-only.
The -s option creates the table, as well as creating constraints and
indexes. These objects need to be dropped prior to loading, if we are to
follow the performance recommendations in the docs. But the only way to
do that is to manually edit the script to produce a cut down script.
So it would be good if we could dump objects in 3 groups
1. all commands required to re-create table
2. data
3. all commands required to complete table after data load
My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above
This would then allow this sequence of commands
pg_dump --schema-pre-load
pg_dump --data-only
pg_dump --schema-post-load
to be logically equivalent, but faster than
pg_dump --schema-only
pg_dump --data-only
both forms of which are equivalent to just
pg_dump
[Assuming data isn't changing between invocations...]
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Wed, Feb 06, 2008 at 03:13:24PM +0000, Simon Riggs wrote:
pg_dump allows you to specify -s --schema-only, or -a --data-only.
The -s option creates the table, as well as creating constraints and
indexes. These objects need to be dropped prior to loading, if we are to
follow the performance recommendations in the docs. But the only way to
do that is to manually edit the script to produce a cut down script.So it would be good if we could dump objects in 3 groups
1. all commands required to re-create table
2. data
3. all commands required to complete table after data loadMy proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) aboveThis would then allow this sequence of commands
pg_dump --schema-pre-load
pg_dump --data-only
pg_dump --schema-post-loadto be logically equivalent, but faster than
pg_dump --schema-only
pg_dump --data-onlyboth forms of which are equivalent to just
pg_dump
[Assuming data isn't changing between invocations...]
I've been considering just this. Another otpion I came up with was a more
generic switch where you'd have:
pg_dump --what=tables
pg_dump --what=indexes,constraints
or something like that. Would give more flexibility, but I'm not sure if
that's worthwhile.
Having the ability to just this filtering that you're talking about would
be very handy - I've needed it more than once.
Where would you load primary keys and such contrants? Pre- or post dump? I
think the case could be made for either one...
//Magnus
Simon Riggs wrote:
My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) aboveThis would then allow this sequence of commands
pg_dump --schema-pre-load
pg_dump --data-only
pg_dump --schema-post-loadto be logically equivalent, but faster than
pg_dump --schema-only
pg_dump --data-only
It would actually be better, as well as faster, because it will be
guaranteed to work :-) There are known cases where schema-only followed
by data-only fails.
One more note: we need to make sure the corresponding new modes are also
added to pg_restore.
cheers
andrew
On Wed, 2008-02-06 at 16:41 +0100, Magnus Hagander wrote:
Where would you load primary keys and such contrants? Pre- or post dump? I
think the case could be made for either one...
Post dump. If the constraint was successfully in place when we performed
the dump then it should work successfully after the load.
I like your syntax/idea as well. I think that would be worth doing in
addition for when you want fine grained control. (I'd add the ability to
dump any named object, allowing you to dump individual indexes,
functions, etc..)
I don't think the two syntaxes compete. I want to be able to say
"everything before", "data" and "everything after" without having to
remember to specify --what=tables,views,sequences,kitchensink etc.
Especially since you probably won't notice you've missed out an object
type (e.g. sequences) until you try to run an application. Too late!
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon wrote:
My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above
Sounds nice.
For a large schema we might rather want one switch that dumps 2 files,
no ?
Probably also better from a mvcc perspective.
Andreas
Simon Riggs wrote:
On Wed, 2008-02-06 at 16:41 +0100, Magnus Hagander wrote:
I don't think the two syntaxes compete. I want to be able to say
"everything before", "data" and "everything after" without having to
remember to specify --what=tables,views,sequences,kitchensink etc.
Especially since you probably won't notice you've missed out an object
type (e.g. sequences) until you try to run an application. Too late!
Is this not a job for the -l / -L mode of pg_restore, but ported to pg_dump?
With a bit of tweaking ti the format you could do something like:
pg_dump -l mydb | grep BEFORE > obj_list.txt
pg_dump -L obj_list.txt mydb > mydb.before.schema
--
Richard Huxton
Archonet Ltd
On Wed, Feb 06, 2008 at 05:10:00PM +0100, Zeugswetter Andreas ADI SD wrote:
Simon wrote:
My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) aboveSounds nice.
For a large schema we might rather want one switch that dumps 2 files,
no ?
Probably also better from a mvcc perspective.
+1
For that matter, it'd be better if you could just get all 3 files (pre,
data, post) in one shot with one transaction; that would guarantee you a
clean dump.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Decibel! wrote:
For that matter, it'd be better if you could just get all 3 files (pre,
data, post) in one shot with one transaction; that would guarantee you a
clean dump.
+1.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, 2008-02-08 at 17:01 -0300, Alvaro Herrera wrote:
Decibel! wrote:
For that matter, it'd be better if you could just get all 3 files (pre,
data, post) in one shot with one transaction; that would guarantee you a
clean dump.+1.
OK, that seems possible.
What do you think the syntax should be for that?
If you specify --data-only you get 1 file
If you specify --schema-only you get either 1 or 2 files
and if you don't specify either you get either 1 or 3 files
How do we name the files?
--pre-schema-file
--data-file
--post-schema-file
If any of the above 3 files are named then --file is treated as an
error.
That allows you to do this:
* dump all 3 files in one go
pg_dump --pre-schema-file=pre_schema.dmp --data-file=data.dmp
--post-schema-file=post_schema.dmp
* dump 2 files for schema
pg_dump --pre-schema-file=pre_schema.dmp
--post-schema-file=post_schema.dmp --schema-only
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
What do you think the syntax should be for that?
If you specify --data-only you get 1 file
If you specify --schema-only you get either 1 or 2 files
and if you don't specify either you get either 1 or 3 filesHow do we name the files?
--pre-schema-file
--data-file
--post-schema-file
I think this is pretty unwieldy. Perhaps we could have a parameter that
received a single file name and appended or prepended something to it to
create the three file names.
--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdata
or something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.
Or is this too complex?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote:
--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdataor something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.Or is this too complex?
Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.
cheers
andrew
On Mon, 2008-02-11 at 10:29 -0500, Andrew Dunstan wrote:
Alvaro Herrera wrote:
--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdataor something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.Or is this too complex?
Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.
What syntax do you suggest?
How about we use the --file as the prefix?
and just use a postfix of .1 and .2 and .3
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Mon, 2008-02-11 at 10:29 -0500, Andrew Dunstan wrote:
Alvaro Herrera wrote:
--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdataor something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.Or is this too complex?
Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.What syntax do you suggest?
How about we use the --file as the prefix?
and just use a postfix of .1 and .2 and .3
seems reasonable.
cheers
andrew
Alvaro Herrera <alvherre@commandprompt.com> writes:
I think this is pretty unwieldy.
I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like
--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz
where specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.
With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.
regards, tom lane
On Mon, 2008-02-11 at 11:51 -0500, Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I think this is pretty unwieldy.
I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like--pre-schema-file = foo
--data-file = bar
--post-schema-file = bazwhere specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.
Looks good from here.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Mon, 2008-02-11 at 11:51 -0500, Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I think this is pretty unwieldy.
I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like--pre-schema-file = foo
--data-file = bar
--post-schema-file = bazwhere specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.Looks good from here.
+1
Simon, are you planning to do this? (assuming everybody agrees on the
syntax)
//Magnus
On Mon, 2008-02-11 at 18:15 +0100, Magnus Hagander wrote:
Simon, are you planning to do this? (assuming everybody agrees on the
syntax)
Yes, will do.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I think this is pretty unwieldy.
I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like--pre-schema-file = foo
--data-file = bar
--post-schema-file = bazwhere specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.
this looks good. But arguably it's really pre-data and post-data.
pre-schema would be something that comes before the schema, no? Or maybe
it should be {pre,post}-data-schema-file.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz
this looks good. But arguably it's really pre-data and post-data.
pre-schema would be something that comes before the schema, no? Or maybe
it should be {pre,post}-data-schema-file.
I was thinking that the names were a bit long, but I'm not sure we can
get away with omitting "-file". --pre-file, --data-file, --post-file
are nice and short, but perhaps fail to convey the meaning?
regards, tom lane
On Feb 11, 2008 3:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2008-02-11 at 10:29 -0500, Andrew Dunstan wrote:
Alvaro Herrera wrote:
--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdataor something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.Or is this too complex?
Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.What syntax do you suggest?
How about we use the --file as the prefix?
and just use a postfix of .1 and .2 and .3
It seems better to me to have a suffix that suggests some sort of
meaning. I'm not sure of the ideal names, but starting with:
.pre-schema, .data, and .post-schema as possibilities seems like a
route to get to possibly-better names...
--
http://linuxfinances.info/info/linuxdistributions.html
"The definition of insanity is doing the same thing over and over and
expecting different results." -- assortedly attributed to Albert
Einstein, Benjamin Franklin, Rita Mae Brown, and Rudyard Kipling