Add --include-table-data-where option to pg_dump, to export only a subset of table data
Many times I've wanted to export a subset of a database, using some sort of
row filter condition on some of the large tables. E.g. copying a
production database to a staging environment, but with some time series
data only from the past month.
We have the existing options:
--include-table=table (and its -t synonym)
--exclude-table=table
--exclude-table-data=table
I propose a new option:
--include-table-data-where=table:filter_clause
One would use this option as follows:
pg_dump --include-table-data-where=largetable:"created_at >=
'2018-05-01'" database_name
The filter_clause is used as the contents of a WHERE clause when querying
the data to generate the COPY statement produced by pg_dump.
I've prepared a proposed patch for this, which is attached. The code
changes are rather straightforward. I did have to add the ability to carry
around an extra pointer-sized object to the simple_list implementation, in
order to allow the filter clause to be associated to the matching oids of
the table pattern. It seemed the best way to augment the existing
simple_list implementation, but change as little as possible elsewhere in
the codebase. (Note that SimpleOidList is actually only used by pg_dump).
Feel free to review and propose any amendments.
Attachments:
pgdump-include-table-data-where-v1.patchapplication/octet-stream; name=pgdump-include-table-data-where-v1.patchDownload
Greetings,
* Carter Thaxton (carter.thaxton@gmail.com) wrote:
Many times I've wanted to export a subset of a database, using some sort of
row filter condition on some of the large tables. E.g. copying a
production database to a staging environment, but with some time series
data only from the past month.We have the existing options:
--include-table=table (and its -t synonym)
--exclude-table=table
--exclude-table-data=tableI propose a new option:
--include-table-data-where=table:filter_clauseOne would use this option as follows:
pg_dump --include-table-data-where=largetable:"created_at >=
'2018-05-01'" database_nameThe filter_clause is used as the contents of a WHERE clause when querying
the data to generate the COPY statement produced by pg_dump.
I've wanted something similar to this in the past as well, and, as
you've seen, we have some support for this kind of thing in pg_dump
already and what you're doing is exposing that.
I've prepared a proposed patch for this, which is attached. The code
changes are rather straightforward. I did have to add the ability to carry
around an extra pointer-sized object to the simple_list implementation, in
order to allow the filter clause to be associated to the matching oids of
the table pattern. It seemed the best way to augment the existing
simple_list implementation, but change as little as possible elsewhere in
the codebase. (Note that SimpleOidList is actually only used by pg_dump).Feel free to review and propose any amendments.
I've only taken a quick look but I don't see any regression tests, for
starters, and it's not clear if this can be passed multiple times for
one pg_dump run (I'd certainly hope that it could be...).
Also, if you haven't already, this should be registered on the
commitfest app, so we don't lose track of it.
Thanks!
Stephen
Hello,
I've only taken a quick look but I don't see any regression tests, for
starters, and it's not clear if this can be passed multiple times for
one pg_dump run (I'd certainly hope that it could be...).
Yes, this will absolutely accept multiple options for one run, which is how
I'd imagine it would typically be used.
In fact, for each table_pattern:filter_clause you provide as an option, it
will apply a corresponding WHERE clause for *every* table that matches the
table_pattern.
So if you happened to use a wildcard in the table_pattern, you could
actually end up with multiple tables filtered by the same WHERE clause.
For example:
pg_dump --include-table-data-where="table_*:created_at >= '2018-05-01'"
--include-table-data-where="other_table:id < 100" db_name
This will filter every table named "table_*", e.g. ["table_0", "table_1",
"table_2", "table_associated"], each with "WHERE created_at >=
'2018-05-01'", and it will also filter "other_table" with "WHERE id < 100".
Not sure how useful the wildcard feature is, but it matches the behavior
of the other pg_dump options that specify tables, and came along for free
by reusing that implementation.
Also, if you haven't already, this should be registered on the
commitfest app, so we don't lose track of it.
Import Notes
Reply to msg id not found: CAGiT_HOE4QvHa4FCXQSk+kRjq+rCvUF6Lb1gxmVryXjN+RsA@mail.gmail.com
On Tue, May 22, 2018 at 4:05 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Carter Thaxton (carter.thaxton@gmail.com) wrote:
pg_dump --include-table-data-where=largetable:"created_at >=
'2018-05-01'" database_nameI've wanted something similar to this in the past as well, and, as
you've seen, we have some support for this kind of thing in pg_dump
already and what you're doing is exposing that.
+1
I've prepared a proposed patch for this, which is attached. The code
changes are rather straightforward. I did have to add the ability to carry
around an extra pointer-sized object to the simple_list implementation, in
order to allow the filter clause to be associated to the matching oids of
the table pattern. It seemed the best way to augment the existing
simple_list implementation, but change as little as possible elsewhere in
the codebase. (Note that SimpleOidList is actually only used by pg_dump).Feel free to review and propose any amendments.
I've only taken a quick look but I don't see any regression tests, for
starters, and it's not clear if this can be passed multiple times for
one pg_dump run (I'd certainly hope that it could be...).Also, if you haven't already, this should be registered on the
commitfest app, so we don't lose track of it.
Thanks for doing that. Unfortunately the patch seems to be corrupted
in some way, maybe ANSI control characters or something... perhaps you
set colour.ui = always in your git config, instead of auto? You might
also consider using git format-patch so you can include a brief commit
message that explains the feature.
--
Thomas Munro
http://www.enterprisedb.com
2018-05-20 20:48 GMT-03:00 Carter Thaxton <carter.thaxton@gmail.com>:
Many times I've wanted to export a subset of a database, using some sort of
row filter condition on some of the large tables. E.g. copying a production
database to a staging environment, but with some time series data only from
the past month.
How would you handle foreign keys? It seems easier to produce a dump
that won't restore.
We have the existing options:
--include-table=table (and its -t synonym)
--exclude-table=table
--exclude-table-data=tableI propose a new option:
--include-table-data-where=table:filter_clause
I remembered an old thread [1]/messages/by-id/1212299813.17810.17.camel@ubuntu. At that time pg_dump was not so
decoupled from the backend. We are far from being decoupled in a way
that someone can write his own pg_dump using only calls from a
library. I'm not sure pg_dump is the right place to add another ETL
parameter. We already have too much parameters that could break a
restore (flexibility is always welcome but too much is not so good).
One would use this option as follows:
pg_dump --include-table-data-where=largetable:"created_at >= '2018-05-01'"
database_name
How would you check that that expression is correct? Every parameter
could quote its value. It means that your parameter have to escape the
quote in '2018-05-01'. Another problem is that your spec does not show
us how you would handle tables like Foo.Bar or "foo:bar" (colon have
to be escaped)?
The filter_clause is used as the contents of a WHERE clause when querying
the data to generate the COPY statement produced by pg_dump.
You are forgetting about --inserts parameter. Could I use
--include-table-data-where and --inserts?
[1]: /messages/by-id/1212299813.17810.17.camel@ubuntu
--
Euler Taveira Timbira -
http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
How would you handle foreign keys? It seems easier to produce a dump
that won't restore.
This proposal will not attempt to be smart about foreign keys or anything
like that. I don't believe that would even be expected.
We have the existing options:
--include-table=table (and its -t synonym)
--exclude-table=table
--exclude-table-data=tableI propose a new option:
--include-table-data-where=table:filter_clauseI remembered an old thread [1]. At that time pg_dump was not so
decoupled from the backend. We are far from being decoupled in a way
that someone can write his own pg_dump using only calls from a
library. I'm not sure pg_dump is the right place to add another ETL
parameter. We already have too much parameters that could break a
restore (flexibility is always welcome but too much is not so good).
In general, I agree with your sentiment that we don't want too much
flexibility in this tool. However, this just seems like a very obvious
missing feature to me. I was frankly surprised that pg_dump didn't already
have it.
I've designed this feature so that it behaves like a more flexible version
between --exclude-table-data and --include-table. Instead of dumping the
schema and zero rows, or the schema and all of the rows, it dumps the
schema and some specific rows.
Providing "--include-table-data-where=table:false" behaves exactly like
--exclude-table-data, and "--include-table-data-where=table:true" behaves
exactly like --include-table.
It does no more or less to prevent a restore. Given that
--exclude-table-data already exists, this seems to introduce no new issues
with restore.
pg_dump --include-table-data-where=largetable:"created_at >=
'2018-05-01'"
database_name
How would you check that that expression is correct?
The patch as already provided produces an error message and appropriate
exit code during the dump process, presenting the invalid SQL that is
produced as part of the WHERE clause.
I could see some value in refactoring it to provide error messages earlier
in the process, but it's actually not bad as is.
Every parameter could quote its value. It means that your parameter have to
escape the
quote in '2018-05-01'.
I don't understand. The double quotes in my example are bash shell
quotes. There is no special quote parsing in this patch. The single
quotes are part of the WHERE clause.
Note that pg_dump already uses getopt_long, so it's not required to use the
= symbol to separate option from its associated value. So, it would also
be fine to call as follows:
pg_dump --include-table-data-where "largetable:created_at >=
'2018-05-01'" database_name
Another problem is that your spec does not show
us how you would handle tables like Foo.Bar or "foo:bar" (colon have
to be escaped)?
Using a dot to separate the schema works just fine. My proposal uses the
same mechanism as --include-table, --exclude-table, and
--exclude-table-data. In fact, it even supports wildcards in those
patterns.
Your point about a colon in the table name is interesting. In all my years
of working with PostgreSQL and other databases, I've never encountered a
table name that contained a colon. Perhaps an escape character, like \:
could work. Is there another separator character you would suggest, which
is illegal in table names, but also intuitive as a separator? Maybe a
comma?
The filter_clause is used as the contents of a WHERE clause when querying
the data to generate the COPY statement produced by pg_dump.You are forgetting about --inserts parameter. Could I use
--include-table-data-where and --inserts?
Yes, the --inserts parameter works just fine. Perhaps I should have said
"the COPY statement or INSERT statements".
Ah yes, thanks. I did in fact have colors enabled.
I've attached a new patch generated by `git format-patch`. Hopefully
that's correct.
On Mon, May 21, 2018 at 4:00 PM, Thomas Munro <thomas.munro@enterprisedb.com
Show quoted text
wrote:
On Tue, May 22, 2018 at 4:05 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Carter Thaxton (carter.thaxton@gmail.com) wrote:
pg_dump --include-table-data-where=largetable:"created_at >=
'2018-05-01'" database_nameI've wanted something similar to this in the past as well, and, as
you've seen, we have some support for this kind of thing in pg_dump
already and what you're doing is exposing that.+1
I've prepared a proposed patch for this, which is attached. The code
changes are rather straightforward. I did have to add the ability tocarry
around an extra pointer-sized object to the simple_list implementation,
in
order to allow the filter clause to be associated to the matching oids
of
the table pattern. It seemed the best way to augment the existing
simple_list implementation, but change as little as possible elsewherein
the codebase. (Note that SimpleOidList is actually only used by
pg_dump).
Feel free to review and propose any amendments.
I've only taken a quick look but I don't see any regression tests, for
starters, and it's not clear if this can be passed multiple times for
one pg_dump run (I'd certainly hope that it could be...).Also, if you haven't already, this should be registered on the
commitfest app, so we don't lose track of it.Thanks for doing that. Unfortunately the patch seems to be corrupted
in some way, maybe ANSI control characters or something... perhaps you
set colour.ui = always in your git config, instead of auto? You might
also consider using git format-patch so you can include a brief commit
message that explains the feature.--
Thomas Munro
http://www.enterprisedb.com
Attachments:
pgdump-include-table-data-where-v2.patchapplication/octet-stream; name=pgdump-include-table-data-where-v2.patchDownload+112-10
On Wed, May 23, 2018 at 5:18 PM, Carter Thaxton
<carter.thaxton@gmail.com> wrote:
Ah yes, thanks. I did in fact have colors enabled.
I've attached a new patch generated by `git format-patch`. Hopefully that's
correct.
pg_dump.c:2323:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
char *filter_clause = NULL;
^
You need to declare this variable at the top of its scope. If you're
using GCC or Clang you might consider building with COPT=-Werror so
that any compiler warnings will stop the build from succeeding.
This doesn't build on Windows[1]https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.311, probably for the same reason.
/*
* Is OID present in the list?
+ * Also return extra pointer-sized data by setting extra_data paramter
*/
bool
-simple_oid_list_member(SimpleOidList *list, Oid val)
+simple_oid_list_member2(SimpleOidList *list, Oid val, void **extra_data)
I feel like that isn't in the spirit of Lisp "member". It's now a
kind of association list. I wonder if we are really constrained to
use the cave-man facilities in fe_utils anyway. Though I suppose this
list is never going to be super large so maybe the data structure
doesn't matter too much (famous last words).
+ char *where_clause = pg_malloc(strlen(filter_clause) + 8 + 1);
+ strcpy(where_clause, "WHERE (");
+ strcat(where_clause, filter_clause);
+ strcat(where_clause, ")");
pg_dump.c seems to be allowed to use psprintf() which'd be less
fragile than the above code.
+ /* When match_data is set, split the pattern on the ':' chararcter,
typo
+ * Also return extra pointer-sized data by setting extra_data paramter
typo
[1]: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.311
--
Thomas Munro
http://www.enterprisedb.com
pg_dump.c:2323:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
char *filter_clause = NULL;
^You need to declare this variable at the top of its scope. If you're
using GCC or Clang you might consider building with COPT=-Werror so
that any compiler warnings will stop the build from succeeding.This doesn't build on Windows[1], probably for the same reason.
Done. And thanks for the tip about COPT=-Werror
/* * Is OID present in the list? + * Also return extra pointer-sized data by setting extra_data paramter */ bool -simple_oid_list_member(SimpleOidList *list, Oid val) +simple_oid_list_member2(SimpleOidList *list, Oid val, void **extra_data)I feel like that isn't in the spirit of Lisp "member". It's now a
kind of association list. I wonder if we are really constrained to
use the cave-man facilities in fe_utils anyway. Though I suppose this
list is never going to be super large so maybe the data structure
doesn't matter too much (famous last words).
Yeah, I'm just trying to fit into the surrounding code as much as
possible. If you have a specific recommendation, I'm all ears.
SimpleOidList is only used by pg_dump, so if we want to rename or refactor
this data structure, it won't have much widespread impact.
And you're right that the list is not going to be particularly large.
Consider that it's already a simple linked-list, and not some more complex
hashtable, for the use cases that it already covers in pg_dump. For all of
these uses, it will only be as large as the number of options provided on
the command-line.
+ char *where_clause = pg_malloc(strlen(filter_clause) + 8 + 1); + strcpy(where_clause, "WHERE ("); + strcat(where_clause, filter_clause); + strcat(where_clause, ")");pg_dump.c seems to be allowed to use psprintf() which'd be less
fragile than the above code.
Done. Didn't realize psprintf() was available here.
typo
And fixed typos.
Thanks for the review!
Attachments:
pgdump-include-table-data-where-v3.patchapplication/octet-stream; name=pgdump-include-table-data-where-v3.patchDownload+108-10
After some consideration, I've created a new patch that addresses even more
of the various concerns.
Most notably, the command-line option is shortened to simply --where, which
is much easier to read and understand,
and matches the earlier proposal [1]https://ci.appveyor.com/project/postgresql-cfbot/ postgresql/build/1.0.311.
bool -simple_oid_list_member(SimpleOidList *list, Oid val) +simple_oid_list_member2(SimpleOidList *list, Oid val, void **extra_data)I feel like that isn't in the spirit of Lisp "member". It's now a
kind of association list.
My new patch has this function named simple_oid_list_find_data, to indicate
that it's working with some extra data,
and the corresponding append is called simple_oid_list_append_data.
Another problem is that your spec does not show
us how you would handle tables like Foo.Bar or "foo:bar" (colon have
to be escaped)?
This is now addressed in the v4 patch. I've added a new function to
string_utils,
called findUnquotedChar, which is used to find the colon character in the
command-line argument,
which separates the table name from the filter clause.
For example, if you have a table called "foo:bar", then you would use the
--where option as follows:
pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
If you have a table with a double-quote in the name (gah!), as well as
colons, like: has"quote:and:colons
then this still works, because such quotes in the name are escaped by
doubling the quote char,
e.g. "has""quote:and:colons", and also works with this patch:
pg_dump --where '"has""quote:and:colons":created_at >= '2018-05-1'" dbname
[1]: https://ci.appveyor.com/project/postgresql-cfbot/ postgresql/build/1.0.311
postgresql/build/1.0.311
Attachments:
pgdump-include-table-data-where-v4.patchapplication/octet-stream; name=pgdump-include-table-data-where-v4.patchDownload+138-10
hey,
i am reviewing this patch
On Thu, May 31, 2018 at 4:49 AM, Carter Thaxton <carter.thaxton@gmail.com>
wrote:
pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
it would be more sqlish if it specified like:
--table=foo --where ="bar created_at >= 2018-05-01"
and i don't like the idea of duplicating the existing --table behavior it
may confuse user
i rather recommend extending it. And when i test it with --table option the
content of dump
file depend on the option specified first.
Regards
Surafel
On Fri, Jun 29, 2018 at 8:09 AM, Surafel Temesgen <surafel3000@gmail.com> wrote:
hey,
i am reviewing this patch
On Thu, May 31, 2018 at 4:49 AM, Carter Thaxton <carter.thaxton@gmail.com>
wrote:pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
it would be more sqlish if it specified like:
--table=foo --where ="bar created_at >= 2018-05-01"
and i don't like the idea of duplicating the existing --table behavior it
may confuse user
i rather recommend extending it. And when i test it with --table option the
content of dump
file depend on the option specified first.
But you can specify multiple tables. You wouldn't want the same WHERE
clause to apply to all of them.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
The whole reason for the colon in the --where option is to indicate which
table the WHERE clause should refer to, so that one can dump less than all
of the rows.
The --table option is totally different. It specifies which tables to dump
at all.
If I provide a --where option, and no --table option, I want the WHERE
clause to apply to the given table, and otherwise dump all tables.
If one supplies a --table option, it won't dump all tables - it will only
dump the one specified. I don't want to have to specify all the tables
with --table, just to use the --where option.
Also, there may be some misunderstanding about "foo:bar" above. That's an
example of using a namespaced table, where "bar" is a table in the
namespace "foo". Normally, assuming your table is named "bar" in the
default namespace, you would just say something like:
pg_dump --where "bar:created_at >= 2018-05-01'"
On Mon, Jul 2, 2018 at 11:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Show quoted text
On Fri, Jun 29, 2018 at 8:09 AM, Surafel Temesgen <surafel3000@gmail.com>
wrote:hey,
i am reviewing this patch
On Thu, May 31, 2018 at 4:49 AM, Carter Thaxton <carter.thaxton@gmail.com>
wrote:
pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
it would be more sqlish if it specified like:
--table=foo --where ="bar created_at >= 2018-05-01"
and i don't like the idea of duplicating the existing --table behavior it
may confuse user
i rather recommend extending it. And when i test it with --table optionthe
content of dump
file depend on the option specified first.But you can specify multiple tables. You wouldn't want the same WHERE
clause to apply to all of them.--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jul 02, 2018 at 03:11:46PM -0700, Carter Thaxton wrote:
Also, there may be some misunderstanding about "foo:bar" above. That's an
example of using a namespaced table, where "bar" is a table in the
namespace "foo". Normally, assuming your table is named "bar" in the
default namespace, you would just say something like:pg_dump --where "bar:created_at >= 2018-05-01'"
I am wondering how this works at parsing if the table name, or one of
the columns includes a colon character :)
--
Michael
pg_dump --where "bar:created_at >= 2018-05-01'"
I am wondering how this works at parsing if the table name, or one of
the columns includes a colon character :)
The proposed patch will handle quoted identifiers. E.g. the following will
work just fine:
pg_dump --where 'table:"column:with:colons" = 5'
Note the use of single quotes in the shell, and then double quotes in the
WHERE clause. There are also many other options for quoting in the shell,
of course.
Please don't top-post on the PostgreSQL lists. See <
http://idallen.com/topposting.html>
Sorry. Thanks for the reminder.
On Mon, Jul 2, 2018 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But you can specify multiple tables. You wouldn't want the same WHERE
clause to apply to all of them.also with this new --where option you can specify multiple table using
wildcard and it
try to apply the same where clause to each table. may be its a desirable
feature
because such kind of table can be structurally similar too.
regards
Surafel
On Tue, Jul 3, 2018 at 6:31 AM, Surafel Temesgen <surafel3000@gmail.com> wrote:
On Mon, Jul 2, 2018 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But you can specify multiple tables. You wouldn't want the same WHERE
clause to apply to all of them.also with this new --where option you can specify multiple table using
wildcard and it
try to apply the same where clause to each table. may be its a desirable
feature
because such kind of table can be structurally similar too.
I don't think that's likely to be very useful. I think Carter Thaxton
has the right idea, although using foo:bar to mean foo.bar doesn't
seem like a great plan.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jul 3, 2018 at 1:11 AM, Carter Thaxton <carter.thaxton@gmail.com>
wrote:
The whole reason for the colon in the --where option is to indicate which
table the WHERE clause should refer to, so that one can dump less than all
of the rows.
The --table option is totally different. It specifies which tables to
dump at all.
Thank you for explaining,
I just have one comment . I found the error message generated on incorrect
where clause specification strange for pg_dump. I think query result status
check needed to handle it and generate more friendly error message.
regards
Surafel
On Mon, May 21, 2018 at 6:34 AM Carter Thaxton <carter.thaxton@gmail.com>
wrote:
Many times I've wanted to export a subset of a database, using some sort
of row filter condition on some of the large tables. E.g. copying a
production database to a staging environment, but with some time series
data only from the past month.We have the existing options:
--include-table=table (and its -t synonym)
--exclude-table=table
--exclude-table-data=tableI propose a new option:
--include-table-data-where=table:filter_clauseOne would use this option as follows:
pg_dump --include-table-data-where=largetable:"created_at >=
'2018-05-01'" database_nameThe filter_clause is used as the contents of a WHERE clause when querying
the data to generate the COPY statement produced by pg_dump.I've prepared a proposed patch for this, which is attached. The code
changes are rather straightforward. I did have to add the ability to carry
around an extra pointer-sized object to the simple_list implementation, in
order to allow the filter clause to be associated to the matching oids of
the table pattern. It seemed the best way to augment the existing
simple_list implementation, but change as little as possible elsewhere in
the codebase. (Note that SimpleOidList is actually only used by pg_dump).Feel free to review and propose any amendments.
Why not simply use \copy (select * from largetable where created_at >=
'2018-05-01') to stdout? That is what I’ve always done when I need
something like this and have not found it particularly bothersome but
rather quite powerful. And here you have tons of flexibility because you
can do joins and whatever else.
FWIW. Thanks,
Jeremy
On Thu, Sep 6, 2018 at 8:40 AM, Jeremy Finzel <finzelj@gmail.com> wrote:
Why not simply use \copy (select * from largetable where created_at >=
'2018-05-01') to stdout? That is what I’ve always done when I need
something like this and have not found it particularly bothersome but
rather quite powerful. And here you have tons of flexibility because you
can do joins and whatever else.
Just skimming the thread but I'd have to say being able to leverage
pg_dump's dependency resolution is a major reason for adding features to it
instead sticking to writing psql scripts. This feature in a multi-tenant
situation is something with, I suspect, reasonably wide appeal.
David J.