pg_dump additional options for performance

Started by Simon Riggsabout 18 years ago86 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

pg_dump allows you to specify -s --schema-only, or -a --data-only.

The -s option creates the table, as well as creating constraints and
indexes. These objects need to be dropped prior to loading, if we are to
follow the performance recommendations in the docs. But the only way to
do that is to manually edit the script to produce a cut down script.

So it would be good if we could dump objects in 3 groups
1. all commands required to re-create table
2. data
3. all commands required to complete table after data load

My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above

This would then allow this sequence of commands

pg_dump --schema-pre-load
pg_dump --data-only
pg_dump --schema-post-load

to be logically equivalent, but faster than

pg_dump --schema-only
pg_dump --data-only

both forms of which are equivalent to just

pg_dump

[Assuming data isn't changing between invocations...]

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#2Magnus Hagander
magnus@hagander.net
In reply to: Simon Riggs (#1)
Re: pg_dump additional options for performance

On Wed, Feb 06, 2008 at 03:13:24PM +0000, Simon Riggs wrote:

pg_dump allows you to specify -s --schema-only, or -a --data-only.

The -s option creates the table, as well as creating constraints and
indexes. These objects need to be dropped prior to loading, if we are to
follow the performance recommendations in the docs. But the only way to
do that is to manually edit the script to produce a cut down script.

So it would be good if we could dump objects in 3 groups
1. all commands required to re-create table
2. data
3. all commands required to complete table after data load

My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above

This would then allow this sequence of commands

pg_dump --schema-pre-load
pg_dump --data-only
pg_dump --schema-post-load

to be logically equivalent, but faster than

pg_dump --schema-only
pg_dump --data-only

both forms of which are equivalent to just

pg_dump

[Assuming data isn't changing between invocations...]

I've been considering just this. Another otpion I came up with was a more
generic switch where you'd have:
pg_dump --what=tables
pg_dump --what=indexes,constraints

or something like that. Would give more flexibility, but I'm not sure if
that's worthwhile.

Having the ability to just this filtering that you're talking about would
be very handy - I've needed it more than once.

Where would you load primary keys and such contrants? Pre- or post dump? I
think the case could be made for either one...

//Magnus

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Simon Riggs (#1)
Re: pg_dump additional options for performance

Simon Riggs wrote:

My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above

This would then allow this sequence of commands

pg_dump --schema-pre-load
pg_dump --data-only
pg_dump --schema-post-load

to be logically equivalent, but faster than

pg_dump --schema-only
pg_dump --data-only

It would actually be better, as well as faster, because it will be
guaranteed to work :-) There are known cases where schema-only followed
by data-only fails.

One more note: we need to make sure the corresponding new modes are also
added to pg_restore.

cheers

andrew

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Magnus Hagander (#2)
Re: pg_dump additional options for performance

On Wed, 2008-02-06 at 16:41 +0100, Magnus Hagander wrote:

Where would you load primary keys and such contrants? Pre- or post dump? I
think the case could be made for either one...

Post dump. If the constraint was successfully in place when we performed
the dump then it should work successfully after the load.

I like your syntax/idea as well. I think that would be worth doing in
addition for when you want fine grained control. (I'd add the ability to
dump any named object, allowing you to dump individual indexes,
functions, etc..)

I don't think the two syntaxes compete. I want to be able to say
"everything before", "data" and "everything after" without having to
remember to specify --what=tables,views,sequences,kitchensink etc.
Especially since you probably won't notice you've missed out an object
type (e.g. sequences) until you try to run an application. Too late!

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#5Zeugswetter Andreas ADI SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Simon Riggs (#4)
Re: pg_dump additional options for performance

Simon wrote:

My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above

Sounds nice.
For a large schema we might rather want one switch that dumps 2 files,
no ?
Probably also better from a mvcc perspective.

Andreas

#6Richard Huxton
dev@archonet.com
In reply to: Simon Riggs (#4)
Re: pg_dump additional options for performance

Simon Riggs wrote:

On Wed, 2008-02-06 at 16:41 +0100, Magnus Hagander wrote:

I don't think the two syntaxes compete. I want to be able to say
"everything before", "data" and "everything after" without having to
remember to specify --what=tables,views,sequences,kitchensink etc.
Especially since you probably won't notice you've missed out an object
type (e.g. sequences) until you try to run an application. Too late!

Is this not a job for the -l / -L mode of pg_restore, but ported to pg_dump?

With a bit of tweaking ti the format you could do something like:

pg_dump -l mydb | grep BEFORE > obj_list.txt
pg_dump -L obj_list.txt mydb > mydb.before.schema

--
Richard Huxton
Archonet Ltd

#7Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Zeugswetter Andreas ADI SD (#5)
Re: pg_dump additional options for performance

On Wed, Feb 06, 2008 at 05:10:00PM +0100, Zeugswetter Andreas ADI SD wrote:

Simon wrote:

My proposal is to provide two additional modes:
--schema-pre-load corresponding to (1) above
--schema-post-load corresponding to (3) above

Sounds nice.
For a large schema we might rather want one switch that dumps 2 files,
no ?
Probably also better from a mvcc perspective.

+1

For that matter, it'd be better if you could just get all 3 files (pre,
data, post) in one shot with one transaction; that would guarantee you a
clean dump.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#7)
Re: pg_dump additional options for performance

Decibel! wrote:

For that matter, it'd be better if you could just get all 3 files (pre,
data, post) in one shot with one transaction; that would guarantee you a
clean dump.

+1.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#8)
Re: pg_dump additional options for performance

On Fri, 2008-02-08 at 17:01 -0300, Alvaro Herrera wrote:

Decibel! wrote:

For that matter, it'd be better if you could just get all 3 files (pre,
data, post) in one shot with one transaction; that would guarantee you a
clean dump.

+1.

OK, that seems possible.

What do you think the syntax should be for that?

If you specify --data-only you get 1 file
If you specify --schema-only you get either 1 or 2 files
and if you don't specify either you get either 1 or 3 files

How do we name the files?

--pre-schema-file
--data-file
--post-schema-file

If any of the above 3 files are named then --file is treated as an
error.

That allows you to do this:

* dump all 3 files in one go

pg_dump --pre-schema-file=pre_schema.dmp --data-file=data.dmp
--post-schema-file=post_schema.dmp

* dump 2 files for schema

pg_dump --pre-schema-file=pre_schema.dmp
--post-schema-file=post_schema.dmp --schema-only

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#9)
Re: pg_dump additional options for performance

Simon Riggs wrote:

What do you think the syntax should be for that?

If you specify --data-only you get 1 file
If you specify --schema-only you get either 1 or 2 files
and if you don't specify either you get either 1 or 3 files

How do we name the files?

--pre-schema-file
--data-file
--post-schema-file

I think this is pretty unwieldy. Perhaps we could have a parameter that
received a single file name and appended or prepended something to it to
create the three file names.

--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdata

or something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.

Or is this too complex?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#10)
Re: pg_dump additional options for performance

Alvaro Herrera wrote:

--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdata

or something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.

Or is this too complex?

Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.

cheers

andrew

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Andrew Dunstan (#11)
Re: pg_dump additional options for performance

On Mon, 2008-02-11 at 10:29 -0500, Andrew Dunstan wrote:

Alvaro Herrera wrote:

--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdata

or something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.

Or is this too complex?

Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.

What syntax do you suggest?

How about we use the --file as the prefix?
and just use a postfix of .1 and .2 and .3

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Simon Riggs (#12)
Re: pg_dump additional options for performance

Simon Riggs wrote:

On Mon, 2008-02-11 at 10:29 -0500, Andrew Dunstan wrote:

Alvaro Herrera wrote:

--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdata

or something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.

Or is this too complex?

Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.

What syntax do you suggest?

How about we use the --file as the prefix?
and just use a postfix of .1 and .2 and .3

seems reasonable.

cheers

andrew

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#10)
Re: pg_dump additional options for performance

Alvaro Herrera <alvherre@commandprompt.com> writes:

I think this is pretty unwieldy.

I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like

--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz

where specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.

With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.

regards, tom lane

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#14)
Re: pg_dump additional options for performance

On Mon, 2008-02-11 at 11:51 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I think this is pretty unwieldy.

I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like

--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz

where specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.

With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.

Looks good from here.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#16Magnus Hagander
magnus@hagander.net
In reply to: Simon Riggs (#15)
Re: pg_dump additional options for performance

Simon Riggs wrote:

On Mon, 2008-02-11 at 11:51 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I think this is pretty unwieldy.

I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like

--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz

where specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.

With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.

Looks good from here.

+1

Simon, are you planning to do this? (assuming everybody agrees on the
syntax)

//Magnus

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Magnus Hagander (#16)
Re: pg_dump additional options for performance

On Mon, 2008-02-11 at 18:15 +0100, Magnus Hagander wrote:

Simon, are you planning to do this? (assuming everybody agrees on the
syntax)

Yes, will do.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#14)
Re: pg_dump additional options for performance

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I think this is pretty unwieldy.

I agree. Since any multiple-output-file case can't usefully use stdout,
I think we should combine the switches and just have one switch that
says both that you want separated output and what the target filename
is. Thus something like

--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz

where specifying any of these suppresses the "normal" output to stdout.
So, if you give just a subset of them, you get just subset output.

With this design, --schema-only, --data-only, and --file are obsolete,
and we should probably throw an error if any of them are used in
combination with these switches.

this looks good. But arguably it's really pre-data and post-data.
pre-schema would be something that comes before the schema, no? Or maybe
it should be {pre,post}-data-schema-file.

cheers

andrew

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#18)
Re: pg_dump additional options for performance

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

--pre-schema-file = foo
--data-file = bar
--post-schema-file = baz

this looks good. But arguably it's really pre-data and post-data.
pre-schema would be something that comes before the schema, no? Or maybe
it should be {pre,post}-data-schema-file.

I was thinking that the names were a bit long, but I'm not sure we can
get away with omitting "-file". --pre-file, --data-file, --post-file
are nice and short, but perhaps fail to convey the meaning?

regards, tom lane

#20Chris Browne
cbbrowne@acm.org
In reply to: Simon Riggs (#12)
Re: pg_dump additional options for performance

On Feb 11, 2008 3:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2008-02-11 at 10:29 -0500, Andrew Dunstan wrote:

Alvaro Herrera wrote:

--multidump-prefix=foobar
and it creates foobar.1.predata, foobar.2.data, foobar.3.postdata

or something like that? The number would help to sort them
appropriately, and the string would ensure that you know what each file
is ... perhaps we could have %-escapes in the name to expand to both of
these? Perhaps we could have other %-escapes for things like database
name --- so you could say --multidump-filename=%d.%n.%t.dump ... but
then it would be nice to have strftime escapes too.

Or is this too complex?

Yes, I think it is. We do not have to be infinitely flexible. KISS seems
apposite.

What syntax do you suggest?

How about we use the --file as the prefix?
and just use a postfix of .1 and .2 and .3

It seems better to me to have a suffix that suggests some sort of
meaning. I'm not sure of the ideal names, but starting with:
.pre-schema, .data, and .post-schema as possibilities seems like a
route to get to possibly-better names...

--
http://linuxfinances.info/info/linuxdistributions.html
"The definition of insanity is doing the same thing over and over and
expecting different results." -- assortedly attributed to Albert
Einstein, Benjamin Franklin, Rita Mae Brown, and Rudyard Kipling

#21Jeff Davis
pgsql@j-davis.com
In reply to: Zeugswetter Andreas ADI SD (#5)
#22Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Chris Browne (#20)
#23Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#1)
#24Jochem van Dieten
jochemd@gmail.com
In reply to: Tom Lane (#14)
#25Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Jochem van Dieten (#24)
#26Tom Dunstan
pgsql@tomd.cc
In reply to: Jochem van Dieten (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#27)
#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#23)
#30Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#27)
#31Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Simon Riggs (#28)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Dimitri Fontaine (#31)
#33Tom Dunstan
pgsql@tomd.cc
In reply to: Simon Riggs (#32)
#34Zeugswetter Andreas ADI SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Simon Riggs (#32)
#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Dunstan (#33)
#36Andrew Dunstan
andrew@dunslane.net
In reply to: Simon Riggs (#35)
#37Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Dunstan (#33)
#38Magnus Hagander
magnus@hagander.net
In reply to: Andrew Dunstan (#36)
#39Simon Riggs
simon@2ndQuadrant.com
In reply to: Andrew Dunstan (#36)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Magnus Hagander (#38)
#41Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Magnus Hagander (#38)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#30)
#43Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#42)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#42)
#45David BOURIAUD
david.bouriaud@ac-rouen.fr
In reply to: Tom Lane (#42)
#46Magnus Hagander
magnus@hagander.net
In reply to: Simon Riggs (#44)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Magnus Hagander (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#29)
#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#44)
#50Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#48)
#51Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#49)
#52Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#39)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#52)
#54Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#52)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#50)
#56Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#55)
#57Jeff Davis
pgsql@j-davis.com
In reply to: Simon Riggs (#50)
#58Joshua D. Drake
jd@commandprompt.com
In reply to: Jeff Davis (#57)
#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#58)
#60Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#59)
#61Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#53)
#62Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#59)
#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#55)
#64Joshua D. Drake
jd@commandprompt.com
In reply to: Greg Smith (#61)
#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#62)
#66Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#55)
#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#60)
#68Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#55)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#68)
#70Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#69)
#71Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#68)
#72Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#67)
#73Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Joshua D. Drake (#64)
#74Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#69)
#75Bruce Momjian
bruce@momjian.us
In reply to: Jim Nasby (#74)
#76Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#55)
#77Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#49)
#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#76)
#79Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#78)
#80Joshua D. Drake
jd@commandprompt.com
In reply to: Bruce Momjian (#76)
#81Bruce Momjian
bruce@momjian.us
In reply to: Joshua D. Drake (#80)
#82Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#81)
#83Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#75)
#84Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
#85Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#84)
#86Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#85)