pg_restore --no-post-data and --post-data-only
Attached is an undocumented patch that allows pg_restore to omit
post-data items or omit all but post-data items. This has been discussed
before, and Simon sent in a patch back on 2008, which has bitrotted
some. I'm not sure why it was dropped at the time, but I think it's time
to do this. This patch relies on some infrastructure that was added
since Simon's patch, so it works a bit differently (and more simply).
So with this patch, the following three sequences should be equivalent:
pg_restore --no-post-data
pg_restore --post-data-only
pg_restore -s --no-post-data
pg_restore -a
pg_restore --post-data-only
pg_restore
This is useful and worth doing on its own, and will also add to the
usefulness of the pg_dump --exclude-table-data patch in my previous email.
As with that patch, a version that applies to version 9.0 and 8.4
sources is also attached, for the very eager.
cheers
andrew
On 8/23/11 1:30 PM, Andrew Dunstan wrote:
Attached is an undocumented patch that allows pg_restore to omit
post-data items or omit all but post-data items. This has been discussed
before, and Simon sent in a patch back on 2008, which has bitrotted
some. I'm not sure why it was dropped at the time, but I think it's time
to do this. This patch relies on some infrastructure that was added
since Simon's patch, so it works a bit differently (and more simply).
If it's not clear from Andrew's description, the purpose of this patch
is to allow dividing your pgdump into 3 portions:
1. schema
2. data
3. constraints/indexes
This allows users to implement a number of custom solutions for ad-hoc
parallel dump, conditional loading, data munging and sampled databases.
While doing so was possible before using the manifest from pg_restore
-l, the manifest approach has been complex to automate and relies on
obscure knowledge.
I have immediate production use for this patch and may be backporting it.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On 08/24/2011 08:43 PM, Josh Berkus wrote:
On 8/23/11 1:30 PM, Andrew Dunstan wrote:
Attached is an undocumented patch that allows pg_restore to omit
post-data items or omit all but post-data items. This has been discussed
before, and Simon sent in a patch back on 2008, which has bitrotted
some. I'm not sure why it was dropped at the time, but I think it's time
to do this. This patch relies on some infrastructure that was added
since Simon's patch, so it works a bit differently (and more simply).If it's not clear from Andrew's description, the purpose of this patch
is to allow dividing your pgdump into 3 portions:1. schema
2. data
3. constraints/indexesThis allows users to implement a number of custom solutions for ad-hoc
parallel dump, conditional loading, data munging and sampled databases.
While doing so was possible before using the manifest from pg_restore
-l, the manifest approach has been complex to automate and relies on
obscure knowledge.I have immediate production use for this patch and may be backporting it.
It's already backported, at least as far as 8.4. Check your email :-)
cheers
andrew
On Aug 24, 2011, at 7:43 PM, Josh Berkus wrote:
On 8/23/11 1:30 PM, Andrew Dunstan wrote:
Attached is an undocumented patch that allows pg_restore to omit
post-data items or omit all but post-data items. This has been discussed
before, and Simon sent in a patch back on 2008, which has bitrotted
some. I'm not sure why it was dropped at the time, but I think it's time
to do this. This patch relies on some infrastructure that was added
since Simon's patch, so it works a bit differently (and more simply).If it's not clear from Andrew's description, the purpose of this patch
is to allow dividing your pgdump into 3 portions:1. schema
2. data
3. constraints/indexesThis allows users to implement a number of custom solutions for ad-hoc
parallel dump, conditional loading, data munging and sampled databases.
While doing so was possible before using the manifest from pg_restore
-l, the manifest approach has been complex to automate and relies on
obscure knowledge.I have immediate production use for this patch and may be backporting it.
FWIW, I got around this by writing a perl script that calls pg_dump -s and watches for the end of table create statements (IIRC it specifically looks for the first CREATE INDEX). The advantage to that approach is that you don't have to first create a custom format dump and then run pg_restore against that.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On 08/25/2011 06:05 PM, Jim Nasby wrote:
On Aug 24, 2011, at 7:43 PM, Josh Berkus wrote:
On 8/23/11 1:30 PM, Andrew Dunstan wrote:
Attached is an undocumented patch that allows pg_restore to omit
post-data items or omit all but post-data items. This has been discussed
before, and Simon sent in a patch back on 2008, which has bitrotted
some. I'm not sure why it was dropped at the time, but I think it's time
to do this. This patch relies on some infrastructure that was added
since Simon's patch, so it works a bit differently (and more simply).If it's not clear from Andrew's description, the purpose of this patch
is to allow dividing your pgdump into 3 portions:1. schema
2. data
3. constraints/indexesThis allows users to implement a number of custom solutions for ad-hoc
parallel dump, conditional loading, data munging and sampled databases.
While doing so was possible before using the manifest from pg_restore
-l, the manifest approach has been complex to automate and relies on
obscure knowledge.I have immediate production use for this patch and may be backporting it.
FWIW, I got around this by writing a perl script that calls pg_dump -s and watches for the end of table create statements (IIRC it specifically looks for the first CREATE INDEX). The advantage to that approach is that you don't have to first create a custom format dump and then run pg_restore against that.
Well, notwithstanding my well known love of perl, that strikes me as
spending a pound to save a penny. And custom format dumps rock ;-) Also,
your recipe above is buggy, BTW. A CREATE INDEX statement might well not
be the first item in the post-data section.
But we could also add these switches to pg_dump too if people feel it's
worthwhile. I haven't looked but the logic should not be terribly hard.
cheers
andrew
On 08/25/2011 06:15 PM, Andrew Dunstan wrote:
But we could also add these switches to pg_dump too if people feel
it's worthwhile. I haven't looked but the logic should not be terribly
hard.
Something like the attached, in fact, which seems pretty simple.
cheers
andrew
Attachments:
exclude-post-data-steps_pg_dump.patchtext/x-patch; name=exclude-post-data-steps_pg_dump.patchDownload+26-1
On Fri, Aug 26, 2011 at 11:22 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
But we could also add these switches to pg_dump too if people feel it's
worthwhile. I haven't looked but the logic should not be terribly hard.Something like the attached, in fact, which seems pretty simple.
It seems like there are three sets of things you might want here:
pre-data, data, post-data. So in the end we could end up with:
--pre-data-only
--post-data-only
--data-only
--no-pre-data
--no-post-data
--no-data
And then maybe someone will want just the create index commands and
not the constraint commands. It seems like it might be more elegant
to come up with a single switch where you can list which things you
want:
--sections='predata data'
--sections='postdata'
--sections='index'
Just thinking out loud....
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, 2011-08-26 at 12:46 -0400, Robert Haas wrote:
--sections='predata data'
--sections='postdata'
--sections='index'
Agreed. After command line options reach a certain level of complexity,
I think it's worth looking for a more general way to express them.
Regards,
Jeff Davis
On 08/26/2011 12:46 PM, Robert Haas wrote:
On Fri, Aug 26, 2011 at 11:22 AM, Andrew Dunstan<andrew@dunslane.net> wrote:
But we could also add these switches to pg_dump too if people feel it's
worthwhile. I haven't looked but the logic should not be terribly hard.Something like the attached, in fact, which seems pretty simple.
It seems like there are three sets of things you might want here:
pre-data, data, post-data. So in the end we could end up with:--pre-data-only
--post-data-only
--data-only
--no-pre-data
--no-post-data
--no-dataAnd then maybe someone will want just the create index commands and
not the constraint commands. It seems like it might be more elegant
to come up with a single switch where you can list which things you
want:--sections='predata data'
--sections='postdata'
--sections='index'Just thinking out loud....
I knew there would be some bike-shedding about how we specify these
things, which is why I haven't written docs yet.
All the possibilities you specify except for the indexes section can be
done by using these switches in combination with -s and -a.
For anything more fine-grained, I'm inclined to say that people need to
roll their own. pg_restore's --list and --use-list give you extremely
fine-grained control. I have working scripts which use these for example
to filter out londiste and pgq objects, certain large tables, audit
objects and more. As an example of the complexity I think we should
avoid, which section would UNIQUE and PRIMARY KEY constraints belong in?
"constraints" because that's what they are, or "indexes" because that's
what they create? No matter which answer you choose someone will claim
you have violated POLA.
Chopping things into pre-data, data and post-data would get us around
99% of the cases we could reasonably provide for in my experience. That
seems enough :-)
I don't have anything in principle against your '--sections="foo bar"'
suggestion, but it would be more work to program. Simpler, and probably
more consistent with how we do other things, would be allowing multiple
--section options, if we don't want to have named options such as I have
provided.
cheers
andrew
On Fri, Aug 26, 2011 at 1:15 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
I don't have anything in principle against your '--sections="foo bar"'
suggestion, but it would be more work to program. Simpler, and probably more
consistent with how we do other things, would be allowing multiple --section
options, if we don't want to have named options such as I have provided.
I wouldn't object to that, but "more work to program" probably means
about an extra 10 lines of code in this particular case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Excerpts from Robert Haas's message of vie ago 26 15:36:36 -0300 2011:
On Fri, Aug 26, 2011 at 1:15 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
I don't have anything in principle against your '--sections="foo bar"'
suggestion, but it would be more work to program. Simpler, and probably more
consistent with how we do other things, would be allowing multiple --section
options, if we don't want to have named options such as I have provided.I wouldn't object to that, but "more work to program" probably means
about an extra 10 lines of code in this particular case.
The "--section=data --section=indexes" proposal seems very reasonable to
me -- more so than "--sections='data indexes'".
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Aug 26, 2011, at 12:15 PM, Andrew Dunstan wrote:
I knew there would be some bike-shedding about how we specify these things, which is why I haven't written docs yet.
While we're debating what shade of yellow to paint the shed...
My actual use case is to be able to be able to "inject" SQL into a SQL-formatted dump either pre- or post-data (I'm on 8.3, so I don't actually dump any data; I'm *mostly* emulating the ability to dump data on just certain tables).
So for what I'm doing, the ideal interface would be a way to tell pg_dump "When you're done dumping all table structures but before you get to any constraints, please run $COMMAND and inject it's output into the dump output." For some of the data obfuscation we're doing it would be easiest if $COMMAND was a perl script instead of SQL, but we could probably convert it.
Of course, many other folks actually need the ability to just spit out specific portions of the dump; I'm hoping we can come up with something that supports both concepts.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
Alvaro Herrera <alvherre@commandprompt.com> writes:
The "--section=data --section=indexes" proposal seems very reasonable to
me -- more so than "--sections='data indexes'".
+1 ... not only easier to code and less squishily defined, but more like
the existing precedent for other pg_dump switches, such as --table.
regards, tom lane
On 08/26/2011 04:46 PM, Jim Nasby wrote:
On Aug 26, 2011, at 12:15 PM, Andrew Dunstan wrote:
I knew there would be some bike-shedding about how we specify these things, which is why I haven't written docs yet.
While we're debating what shade of yellow to paint the shed...
My actual use case is to be able to be able to "inject" SQL into a SQL-formatted dump either pre- or post-data (I'm on 8.3, so I don't actually dump any data; I'm *mostly* emulating the ability to dump data on just certain tables).
So for what I'm doing, the ideal interface would be a way to tell pg_dump "When you're done dumping all table structures but before you get to any constraints, please run $COMMAND and inject it's output into the dump output." For some of the data obfuscation we're doing it would be easiest if $COMMAND was a perl script instead of SQL, but we could probably convert it.
Of course, many other folks actually need the ability to just spit out specific portions of the dump; I'm hoping we can come up with something that supports both concepts.
Well, the Unix approach is to use tools that do one thing well to build
up more complex tools. Making pg_dump run some external command to
inject things into the stream seems like the wrong thing given this
philosophy. Use pg_dump to get the bits you want (pre-data, post-data)
and sandwich them around whatever else you want. As for getting data
from just certain tables, I just posted a patch for pg_dump to exclude
data for certain tables, and we could look at providing a positive as
well as a negative filter if there is sufficient demand.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
For anything more fine-grained, I'm inclined to say that people need to roll
their own. pg_restore's --list and --use-list give you extremely
fine-grained control. I have working scripts which use these for example to
filter out londiste and pgq objects, certain large tables, audit objects and
Which is exactly the core features of pg_staging, that builds schema
whitelist and schema_nodata options on top of pg_restore listing. The
only complex thing here is to be able to filter out triggers using a
function defined in a schema you're filtering out, but pg_staging has
support for that.
http://tapoueh.org/pgsql/pgstaging.html
https://github.com/dimitri/pg_staging
http://tapoueh.org/blog/2011/03/29-towards-pg_staging-10.html
And you can also only use the pg_restore listing commands of pg_staging
without having to do the full installation of its features. Will write
some article about how to use it for only catalog listing purpose,
without its infrastructure for fetching backups and managing dev staging
environments.
I don't have anything in principle against your '--sections="foo bar"'
suggestion, but it would be more work to program. Simpler, and probably more
consistent with how we do other things, would be allowing multiple --section
options, if we don't want to have named options such as I have provided.
+1 for --section foo --section bar.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160
Well, notwithstanding my well known love of perl, that strikes me as
spending a pound to save a penny. And custom format dumps rock ;-) Also,
your recipe above is buggy, BTW. A CREATE INDEX statement might well not
be the first item in the post-data section.But we could also add these switches to pg_dump too if people feel it's
worthwhile. I haven't looked but the logic should not be terribly hard.
A big +1 to --pre-data and --post-data, but until we get there, or
if you have an existing dump file (schema *or* schema+data) that needs
parsing, there is an existing tool:
http://blog.endpoint.com/2010/01/splitting-postgres-pgdump-into-pre-and.html
Once these new flags and the ability to custom format dump pg_dumpall
is done, I'll have very little left to complain about with pg_dump :)
- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201108271855
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----
iEYEAREDAAYFAk5ZdfwACgkQvJuQZxSWSsipDQCgpmNtD/I/2gfAzm2b3jouD8nS
qhgAn33t5VLiF8HeslBwCqyMzQJy6VN5
=PfK7
-----END PGP SIGNATURE-----
On 08/27/2011 06:56 PM, Greg Sabino Mullane wrote:
Once these new flags and the ability to custom format dump pg_dumpall
is done, I'll have very little left to complain about with pg_dump :)
It's off topic. But I think custom format would require a major mangling
to be able to handle a complete cluster. This isn't just a simple matter
of programming, IMNSHO.
cheers
andrew
On Aug 26, 2011, at 5:23 PM, Andrew Dunstan wrote:
On 08/26/2011 04:46 PM, Jim Nasby wrote:
On Aug 26, 2011, at 12:15 PM, Andrew Dunstan wrote:
I knew there would be some bike-shedding about how we specify these things, which is why I haven't written docs yet.
While we're debating what shade of yellow to paint the shed...
My actual use case is to be able to be able to "inject" SQL into a SQL-formatted dump either pre- or post-data (I'm on 8.3, so I don't actually dump any data; I'm *mostly* emulating the ability to dump data on just certain tables).
So for what I'm doing, the ideal interface would be a way to tell pg_dump "When you're done dumping all table structures but before you get to any constraints, please run $COMMAND and inject it's output into the dump output." For some of the data obfuscation we're doing it would be easiest if $COMMAND was a perl script instead of SQL, but we could probably convert it.
Of course, many other folks actually need the ability to just spit out specific portions of the dump; I'm hoping we can come up with something that supports both concepts.
Well, the Unix approach is to use tools that do one thing well to build up more complex tools. Making pg_dump run some external command to inject things into the stream seems like the wrong thing given this philosophy. Use pg_dump to get the bits you want (pre-data, post-data) and sandwich them around whatever else you want.
I agree... except for one little niggling concern: If pg_dump is injecting something, then the DDL is being grabbed with a single, consistent snapshot. --pre and --post do not get you that (though we could probably use the new ability to export snapshots to fix that...)
As for getting data from just certain tables, I just posted a patch for pg_dump to exclude data for certain tables, and we could look at providing a positive as well as a negative filter if there is sufficient demand.
Unfortunately some of the dumped data needs to be sanitized, so that won't work unless I can also dump an arbitrary SELECT. But yes, a positive filter would definitely be welcome.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
Excerpts from Jim Nasby's message of mié ago 31 16:45:59 -0300 2011:
On Aug 26, 2011, at 5:23 PM, Andrew Dunstan wrote:
On 08/26/2011 04:46 PM, Jim Nasby wrote:
On Aug 26, 2011, at 12:15 PM, Andrew Dunstan wrote:
I knew there would be some bike-shedding about how we specify these things, which is why I haven't written docs yet.
While we're debating what shade of yellow to paint the shed...
My actual use case is to be able to be able to "inject" SQL into a SQL-formatted dump either pre- or post-data (I'm on 8.3, so I don't actually dump any data; I'm *mostly* emulating the ability to dump data on just certain tables).
So for what I'm doing, the ideal interface would be a way to tell pg_dump "When you're done dumping all table structures but before you get to any constraints, please run $COMMAND and inject it's output into the dump output." For some of the data obfuscation we're doing it would be easiest if $COMMAND was a perl script instead of SQL, but we could probably convert it.
Of course, many other folks actually need the ability to just spit out specific portions of the dump; I'm hoping we can come up with something that supports both concepts.
Well, the Unix approach is to use tools that do one thing well to build up more complex tools. Making pg_dump run some external command to inject things into the stream seems like the wrong thing given this philosophy. Use pg_dump to get the bits you want (pre-data, post-data) and sandwich them around whatever else you want.
I agree... except for one little niggling concern: If pg_dump is injecting something, then the DDL is being grabbed with a single, consistent snapshot. --pre and --post do not get you that (though we could probably use the new ability to export snapshots to fix that...)
Eh, --pre and --post are pg_restore flags, so you already have a
consistent snapshot.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 08/31/2011 04:03 PM, Alvaro Herrera wrote:
Well, the Unix approach is to use tools that do one thing well to build up more complex tools. Making pg_dump run some external command to inject things into the stream seems like the wrong thing given this philosophy. Use pg_dump to get the bits you want (pre-data, post-data) and sandwich them around whatever else you want.
I agree... except for one little niggling concern: If pg_dump is injecting something, then the DDL is being grabbed with a single, consistent snapshot. --pre and --post do not get you that (though we could probably use the new ability to export snapshots to fix that...)
Eh, --pre and --post are pg_restore flags, so you already have a
consistent snapshot.
We've been talking about adding them for pg_dump too.
I take Jim's point about the snapshot, but I still don't feel it's a
good reason to allow some arbitrary code or script to be run between
them (and after all, it's not likely to run with the same snapshot anyway).
cheers
andrew