pg_dump --split patch

Started by Joel Jacobsonover 15 years ago56 messageshackers
Jump to latest
#1Joel Jacobson
joel@gluefinance.com

Dear fellow hackers,

Problem: A normal diff of two slightly different schema dump files (pg_dump
-s), will not produce a user-friendly diff, as you get all changes in the
same file.

Solution: I propose a new option to pg_dump, --split, which dumps each
object to a separate file in a user friendly directory structure:

[-f filename] : main dump file, imports each splitted part using \i
[-f filename]-split/[desc]/[tag]/[oid].sql : dump of the oid

Example: If the filename (-f) is "pg.dump", the following directory
structure would be created:

$ ./pg_dump -f /crypt/pg.dump --split -F p -s glue
/crypt/pg.dump-split/VIEW/
/crypt/pg.dump-split/TYPE/
/crypt/pg.dump-split/TRIGGER/
/crypt/pg.dump-split/TABLE/
/crypt/pg.dump-split/SEQUENCE/
/crypt/pg.dump-split/SCHEMA/
/crypt/pg.dump-split/PROCEDURAL_LANGUAGE/
/crypt/pg.dump-split/INDEX/
/crypt/pg.dump-split/FUNCTION/
/crypt/pg.dump-split/FK_CONSTRAINT/
/crypt/pg.dump-split/CONSTRAINT/
/crypt/pg.dump-split/AGGREGATE/

In each such directory, one directory per object name is created.
If we would have a function "foobar" with oid "12345" it would be saved to:
/crypt/pg.dump-split/FUNCTION/foobar/12345.sql

In the "pg.dump" plain text file, the files are "linked in" using the "\i"
psql command, e.g.:
\i /crypt/pg.dump-split/FUNCTION/foobar/12345.sql

Potential use-case scenarios:

*) Version control your database schema, by exporting it daily (using
--split) and commiting the differences.

*) Compare differences of schema dumps created in different points in time.
Since objects are stored in separate files, it is easier to see what areas
were modified, compared to looking at the diff of two entire schemas.

*) Restore only some objects, based on type (e.g., only the functions) or
name (e.g. only fucntions of certain name/names).

I've tested the patch for both the latest HEAD (9.1devel) as well as 8.4.6.

Feedback welcome.

--
Best regards,

Joel Jacobson
Glue Finance

Attachments:

pg-dump-split-plain-text-files-8.4.6.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-8.4.6.patchDownload+86-0
pg-dump-split-plain-text-files-9.1devel.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-9.1devel.patchDownload+86-0
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#1)
Re: pg_dump --split patch

Joel Jacobson <joel@gluefinance.com> writes:

Dear fellow hackers,
Problem: A normal diff of two slightly different schema dump files (pg_dump
-s), will not produce a user-friendly diff, as you get all changes in the
same file.

Solution: I propose a new option to pg_dump, --split, which dumps each
object to a separate file in a user friendly directory structure:

Um ... how does that solve the claimed problem exactly?

[-f filename] : main dump file, imports each splitted part using \i
[-f filename]-split/[desc]/[tag]/[oid].sql : dump of the oid

This particular choice seems remarkably *un* friendly, since two dumps
from different DBs will inevitably not share the same OIDs, making it
practically impossible to compare them even if they are logically
identical. But even without the choice to use OIDs in the filenames
I'm unconvinced that file-per-object is a good idea in any way shape or
form.

regards, tom lane

#3Joel Jacobson
joel@gluefinance.com
In reply to: Tom Lane (#2)
Re: pg_dump --split patch

2010/12/28 Tom Lane <tgl@sss.pgh.pa.us>

Joel Jacobson <joel@gluefinance.com> writes:

Dear fellow hackers,
Problem: A normal diff of two slightly different schema dump files

(pg_dump

-s), will not produce a user-friendly diff, as you get all changes in the
same file.

Solution: I propose a new option to pg_dump, --split, which dumps each
object to a separate file in a user friendly directory structure:

Um ... how does that solve the claimed problem exactly?

Because then you can do,
$ diff -r <old schema dump dir> <new schema dump dir>,
instead of,
$ diff <old entire schema dump> <new entire schema dump>
which will nicely reveal each individual object modified, as opposed to a
huge global diff of everything

[-f filename] : main dump file, imports each splitted part using \i
[-f filename]-split/[desc]/[tag]/[oid].sql : dump of the oid

This particular choice seems remarkably *un* friendly, since two dumps
from different DBs will inevitably not share the same OIDs, making it
practically impossible to compare them even if they are logically
identical. But even without the choice to use OIDs in the filenames
I'm unconvinced that file-per-object is a good idea in any way shape or
form.

Good point!

To compare two different database, perhaps it's possible to use a sequence,
1,2,...,n for each file in each directory, i.e., /[desc]/[tag]/[n], and to
sort them by something distinct which will ensure the same numbering between
different databases, such as the arguments for functions, or other
properties for other kind of objects. Any ideas?

(In my case, I didn't need to compare schemas between different database. I
needed to compare two dumps created at different points in time of the same
database, which do share the same oids for objects existing in both
versions.)

regards, tom lane

--
Best regards,

Joel Jacobson
Glue Finance

E: jj@gluefinance.com
T: +46 70 360 38 01

Postal address:
Glue Finance AB
Box 549
114 11 Stockholm
Sweden

Visiting address:
Glue Finance AB
Birger Jarlsgatan 14
114 34 Stockholm
Sweden

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#3)
Re: pg_dump --split patch

Joel Jacobson <joel@gluefinance.com> writes:

2010/12/28 Tom Lane <tgl@sss.pgh.pa.us>

Joel Jacobson <joel@gluefinance.com> writes:

Solution: I propose a new option to pg_dump, --split, which dumps each
object to a separate file in a user friendly directory structure:

Um ... how does that solve the claimed problem exactly?

Because then you can do,
$ diff -r <old schema dump dir> <new schema dump dir>,
instead of,
$ diff <old entire schema dump> <new entire schema dump>
which will nicely reveal each individual object modified, as opposed to a
huge global diff of everything

That has at least as many failure modes as the other representation.

regards, tom lane

#5Joel Jacobson
joel@gluefinance.com
In reply to: Tom Lane (#4)
Re: pg_dump --split patch

2010/12/28 Tom Lane <tgl@sss.pgh.pa.us>

That has at least as many failure modes as the other representation.

I don't follow, what do you mean with "failure modes"? The oid in the
filename? I suggested to use a sequence instead but you didn't comment on
that. Are there any other failure modes which could cause a diff -r between
two different databases to break?

(This might be a bad idea for some other reason, but I noticed a few other
users requesting the same feature when I googled "pg_dump split".)

--
Best regards,

Joel Jacobson
Glue Finance

#6Aidan Van Dyk
aidan@highrise.ca
In reply to: Joel Jacobson (#5)
Re: pg_dump --split patch

On Tue, Dec 28, 2010 at 11:59 AM, Joel Jacobson <joel@gluefinance.com> wrote:

I don't follow, what do you mean with "failure modes"? The oid in the
filename? I suggested to use a sequence instead but you didn't comment on
that. Are there any other failure modes which could cause a diff -r between
two different databases to break?

Both OID and sequence mean that your likely to get a diff which is
nothing more than complete files removed from 1 side and added to the
othe rside with different names (i.e. oid's don't match, or an
added/removed object changes all following sequence assingments).

If you're going to try and split, I really think the only usefull
filename has to be similar to something like:
<schema>/<type>/<name>/<part>

If you want to use "diff", you pretty much have to make sure that the
*path* will be identical for similary named objects, irrespective of
anything else in the database. And path has to be encoding aware.

And you want names that glob well, so for instance, you could exclude
*.data (or a schema) from the diff.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Joel Jacobson (#5)
Re: pg_dump --split patch

On 12/28/2010 11:59 AM, Joel Jacobson wrote:

2010/12/28 Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>>

That has at least as many failure modes as the other representation.

I don't follow, what do you mean with "failure modes"? The oid in the
filename? I suggested to use a sequence instead but you didn't comment
on that. Are there any other failure modes which could cause a diff -r
between two different databases to break?

(This might be a bad idea for some other reason, but I noticed a few
other users requesting the same feature when I googled "pg_dump split".)

A better approach to the problem might be to have a tool which did a
comparison of structures rather than a textual comparison of dumps. For
extra credit, such a tool might even try to produce a sync script for
you ...

Of course, that task might involve more effort than you want to devote
to it.

cheers

andrew

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#5)
Re: pg_dump --split patch

Joel Jacobson <joel@gluefinance.com> writes:

2010/12/28 Tom Lane <tgl@sss.pgh.pa.us>

That has at least as many failure modes as the other representation.

I don't follow, what do you mean with "failure modes"? The oid in the
filename? I suggested to use a sequence instead but you didn't comment on
that. Are there any other failure modes which could cause a diff -r between
two different databases to break?

AFAIK the primary failure modes for diff'ing text dumps are

(1) randomly different ordering of objects from one dump to another.
Your initial proposal would avoid that problem as long as the object
OIDs didn't change, but since it falls down completely across a dump and
reload, or delete and recreate, I can't really see that it's a step
forward. Using a sequence number generated by pg_dump doesn't change
this at all --- the sequence would be just as unpredictable.

(2) randomly different ordering of rows within a table. Your patch
didn't address that, unless I misunderstood quite a bit.

I think the correct fix for (1) is to improve pg_dump's method for
sorting objects. It's not that bad now, but it does have issues with
random ordering of similarly-named objects. IIRC Peter Eisentraut
proposed something for this last winter but it seemed a mite too ugly,
and he got beaten down to just this:

commit 1acc06a1f4ae752793d2199d8d462a6708c8acc2
Author: Peter Eisentraut <peter_e@gmx.net>
Date: Mon Feb 15 19:59:47 2010 +0000

When sorting functions in pg_dump, break ties (same name) by number of argum
ents

Maybe you can do better, but I'd suggest going back to reread the
discussion that preceded that patch.

(This might be a bad idea for some other reason, but I noticed a few other
users requesting the same feature when I googled "pg_dump split".)

AFAIR what those folk really wanted was a selective dump with more
selectivity knobs than exist now. I don't think their lives would be
improved by having to root through a twisty little maze of numbered
files to find the object they wanted.

regards, tom lane

#9Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Joel Jacobson (#1)
Re: pg_dump --split patch

On Tue, Dec 28, 2010 at 11:00 AM, Joel Jacobson <joel@gluefinance.com>wrote:

Dear fellow hackers,

Problem: A normal diff of two slightly different schema dump files (pg_dump
-s), will not produce a user-friendly diff, as you get all changes in the
same file.

Solution: I propose a new option to pg_dump, --split, which dumps each
object to a separate file in a user friendly directory structure:

[-f filename] : main dump file, imports each splitted part using \i
[-f filename]-split/[desc]/[tag]/[oid].sql : dump of the oid

Example: If the filename (-f) is "pg.dump", the following directory
structure would be created:

$ ./pg_dump -f /crypt/pg.dump --split -F p -s glue
/crypt/pg.dump-split/VIEW/
/crypt/pg.dump-split/TYPE/
/crypt/pg.dump-split/TRIGGER/
/crypt/pg.dump-split/TABLE/
/crypt/pg.dump-split/SEQUENCE/
/crypt/pg.dump-split/SCHEMA/
/crypt/pg.dump-split/PROCEDURAL_LANGUAGE/
/crypt/pg.dump-split/INDEX/
/crypt/pg.dump-split/FUNCTION/
/crypt/pg.dump-split/FK_CONSTRAINT/
/crypt/pg.dump-split/CONSTRAINT/
/crypt/pg.dump-split/AGGREGATE/

In each such directory, one directory per object name is created.
If we would have a function "foobar" with oid "12345" it would be saved to:
/crypt/pg.dump-split/FUNCTION/foobar/12345.sql

In the "pg.dump" plain text file, the files are "linked in" using the "\i"
psql command, e.g.:
\i /crypt/pg.dump-split/FUNCTION/foobar/12345.sql

Potential use-case scenarios:

*) Version control your database schema, by exporting it daily (using
--split) and commiting the differences.

*) Compare differences of schema dumps created in different points in time.
Since objects are stored in separate files, it is easier to see what areas
were modified, compared to looking at the diff of two entire schemas.

*) Restore only some objects, based on type (e.g., only the functions) or
name (e.g. only fucntions of certain name/names).

I've tested the patch for both the latest HEAD (9.1devel) as well as 8.4.6.

I would suggest the directory structure as:

/crypt/pg.dump-split/schema-name-1/VIEWS/view-name-1.sql
/crypt/pg.dump-split/schema-name-1/TABLES/table-name-1.sql
...
/crypt/pg.dump-split/schema-name-2/VIEWS/view-name-1.sql
/crypt/pg.dump-split/schema-name-2/TABLES/table-name-1.sql

This might n be more amenable to diff'ing the different dumps. Schemas are
logical grouping of other objects and hence making that apparent in your
dump's hierarchy makes more sense.

Most importantly, as Tom suggested, don't use or rely on OIDs. I think
function overloading is the only case where you can have more than one
object with the same name under a schema. That can be resolved if you
included function signature in filename:

/crypt/pg.dump-split/emp/FUNCTIONS/myfunc-int-char.sql
/crypt/pg.dump-split/emp/FUNCTIONS/myfunc-int-int.sql
/crypt/pg.dump-split/emp/FUNCTIONS/myfunc-int.sql

Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

#10Joel Jacobson
joel@gluefinance.com
In reply to: Gurjeet Singh (#9)
Re: pg_dump --split patch

2010/12/28 Gurjeet Singh <singh.gurjeet@gmail.com>

I would suggest the directory structure as:

/crypt/pg.dump-split/schema-name-1/VIEWS/view-name-1.sql
/crypt/pg.dump-split/schema-name-1/TABLES/table-name-1.sql
...
/crypt/pg.dump-split/schema-name-2/VIEWS/view-name-1.sql
/crypt/pg.dump-split/schema-name-2/TABLES/table-name-1.sql

This might n be more amenable to diff'ing the different dumps. Schemas are
logical grouping of other objects and hence making that apparent in your
dump's hierarchy makes more sense.

Thanks Gurjeet and Tom for good feedback!

I've made some changes and attached new patches.
Looks much better now I think!

This is what I've changed,

*) Not using oid anymore in the filename
*) New filename/path structure: [-f
filename]-split/[schema]/[desc]/[tag].sql
*) If two objects share the same name tag for the same [schema]/[desc], -2,
-3, etc is appended to the name. Example:
~/pg.dump-split/public/FUNCTION/foobar.sql
~/pg.dump-split/public/FUNCTION/foobar-2.sql
~/pg.dump-split/public/FUNCTION/barfoo.sql
~/pg.dump-split/public/FUNCTION/barfoo-2.sql
~/pg.dump-split/public/FUNCTION/barfoo-3.sql

I think you are right about functions (and aggregates) being the only
desc-type where two objects can share the same name in the same schema.
This means the problem of dumping objects in different order is a very
limited problem, only affecting overloaded functions.

I didn't include the arguments in the file name, as it would lead to very
long file names unless truncated, and since the problem is very limited, I
think we shouldn't include it. It's cleaner with just the name part of the
tag in the file name.

--
Best regards,

Joel Jacobson
Glue Finance

Attachments:

pg-dump-split-plain-text-files-8.4.6.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-8.4.6.patchDownload+103-0
pg-dump-split-plain-text-files-9.1devel.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-9.1devel.patchDownload+103-0
#11Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Joel Jacobson (#10)
Re: pg_dump --split patch

On Tue, Dec 28, 2010 at 2:39 PM, Joel Jacobson <joel@gluefinance.com> wrote:

2010/12/28 Gurjeet Singh <singh.gurjeet@gmail.com>

I would suggest the directory structure as:

/crypt/pg.dump-split/schema-name-1/VIEWS/view-name-1.sql
/crypt/pg.dump-split/schema-name-1/TABLES/table-name-1.sql
...
/crypt/pg.dump-split/schema-name-2/VIEWS/view-name-1.sql
/crypt/pg.dump-split/schema-name-2/TABLES/table-name-1.sql

This might n be more amenable to diff'ing the different dumps. Schemas are
logical grouping of other objects and hence making that apparent in your
dump's hierarchy makes more sense.

Thanks Gurjeet and Tom for good feedback!

I've made some changes and attached new patches.
Looks much better now I think!

This is what I've changed,

*) Not using oid anymore in the filename
*) New filename/path structure: [-f
filename]-split/[schema]/[desc]/[tag].sql
*) If two objects share the same name tag for the same [schema]/[desc], -2,
-3, etc is appended to the name. Example:
~/pg.dump-split/public/FUNCTION/foobar.sql
~/pg.dump-split/public/FUNCTION/foobar-2.sql
~/pg.dump-split/public/FUNCTION/barfoo.sql
~/pg.dump-split/public/FUNCTION/barfoo-2.sql
~/pg.dump-split/public/FUNCTION/barfoo-3.sql

I think you are right about functions (and aggregates) being the only
desc-type where two objects can share the same name in the same schema.
This means the problem of dumping objects in different order is a very
limited problem, only affecting overloaded functions.

I didn't include the arguments in the file name, as it would lead to very
long file names unless truncated, and since the problem is very limited, I
think we shouldn't include it. It's cleaner with just the name part of the
tag in the file name.

I haven't seen your code yet, but we need to make sure that in case of name
collision we emit the object definitions in a sorted order so that the dump
is always deterministic: func1(char) should be _always_ dumped before
func1(int), that is, output file names are always deterministic.

The problem I see with suffixing a sequence id to the objects with name
collision is that one day the dump may name myfunc(int) as myfunc.sql and
after an overloaded version is created, say myfunc(char, int), then the same
myfunc(int) may be dumped in myfunc-2.sql, which again is non-deterministic.

Also, it is a project policy that we do not introduce new features in back
branches, so spending time on an 8.4.6 patch may not be the best use of your
time.

Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

#12Joel Jacobson
joel@gluefinance.com
In reply to: Gurjeet Singh (#11)
Re: pg_dump --split patch

Sent from my iPhone

On 28 dec 2010, at 21:45, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:

The problem I see with suffixing a sequence id to the objects with name
collision is that one day the dump may name myfunc(int) as myfunc.sql and
after an overloaded version is created, say myfunc(char, int), then the same
myfunc(int) may be dumped in myfunc-2.sql, which again is non-deterministic.

I agree, good point!
Perhaps abbreviations are to prefer, e.g., myfunc_i, myfunc_i_c, etc to
reduce the need of truncating filenames.

Also, it is a project policy that we do not introduce new features in back
branches, so spending time on an 8.4.6 patch may not be the best use of your
time.

My company is using 8.4 and needs this feature, so I'll have to patch it
anyway :)

Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Joel Jacobson (#12)
Re: pg_dump --split patch

On 12/28/2010 04:44 PM, Joel Jacobson wrote:

The problem I see with suffixing a sequence id to the objects with
name collision is that one day the dump may name myfunc(int) as
myfunc.sql and after an overloaded version is created, say
myfunc(char, int), then the same myfunc(int) may be dumped in
myfunc-2.sql, which again is non-deterministic.

I agree, good point!
Perhaps abbreviations are to prefer, e.g., myfunc_i, myfunc_i_c, etc
to reduce the need of truncating filenames.

I think that's just horrible. Does the i stand for integer or inet? And
it will get *really* ugly for type names with spaces in them ...

cheers

andrew

#14Joel Jacobson
joel@gluefinance.com
In reply to: Andrew Dunstan (#13)
Re: pg_dump --split patch

2010/12/28 Andrew Dunstan <andrew@dunslane.net>

I think that's just horrible. Does the i stand for integer or inet? And it
will get *really* ugly for type names with spaces in them ...

True, true.

But while "c" is too short, I think "character varying" is too long. Is
there some convenient lookup table to convert between the long names to the
short names?
E.g.,
character varying => varchar
timestamp with time zone => timestamptz
etc.

--
Best regards,

Joel Jacobson
Glue Finance

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#13)
Re: pg_dump --split patch

Andrew Dunstan <andrew@dunslane.net> writes:

On 12/28/2010 04:44 PM, Joel Jacobson wrote:

Perhaps abbreviations are to prefer, e.g., myfunc_i, myfunc_i_c, etc
to reduce the need of truncating filenames.

I think that's just horrible. Does the i stand for integer or inet? And
it will get *really* ugly for type names with spaces in them ...

You think spaces are bad, try slashes ;-)

Not to mention the need for including schemas in typenames sometimes.
I think you're going to have a real problem trying to fully describe a
function's signature in a file name of reasonable max length.

regards, tom lane

#16Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Andrew Dunstan (#13)
Re: pg_dump --split patch

On Tue, Dec 28, 2010 at 4:57 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

On 12/28/2010 04:44 PM, Joel Jacobson wrote:

The problem I see with suffixing a sequence id to the objects with name

collision is that one day the dump may name myfunc(int) as myfunc.sql and
after an overloaded version is created, say myfunc(char, int), then the same
myfunc(int) may be dumped in myfunc-2.sql, which again is non-deterministic.

I agree, good point!
Perhaps abbreviations are to prefer, e.g., myfunc_i, myfunc_i_c, etc to
reduce the need of truncating filenames.

I think that's just horrible. Does the i stand for integer or inet? And it
will get *really* ugly for type names with spaces in them ...

Do you mean using data type names in filename is a bad idea, or is
abbreviating the type names is a bad idea?

Maybe we can compute a hash based on the type names and use that in the
file's name?

Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

#17David Wilson
david.t.wilson@gmail.com
In reply to: Joel Jacobson (#10)
Re: pg_dump --split patch

On Tue, Dec 28, 2010 at 2:39 PM, Joel Jacobson <joel@gluefinance.com> wrote:

I think you are right about functions (and aggregates) being the only
desc-type where two objects can share the same name in the same schema.
This means the problem of dumping objects in different order is a very
limited problem, only affecting overloaded functions.

I didn't include the arguments in the file name, as it would lead to very
long file names unless truncated, and since the problem is very limited, I
think we shouldn't include it. It's cleaner with just the name part of the
tag in the file name.

Why not place all overloads of a function within the same file? Then,
assuming you order them deterministically within that file, we sidestep the
file naming issue and maintain useful diff capabilities, since a diff of the
function's file will show additions or removals of various overloaded
versions.

--
- David T. Wilson
david.t.wilson@gmail.com

#18Joel Jacobson
joel@gluefinance.com
In reply to: David Wilson (#17)
Re: pg_dump --split patch

2010/12/29 David Wilson <david.t.wilson@gmail.com>

Why not place all overloads of a function within the same file? Then,
assuming you order them deterministically within that file, we sidestep the
file naming issue and maintain useful diff capabilities, since a diff of the
function's file will show additions or removals of various overloaded
versions.

Good suggestion. I agree, trying to put variations of the same function in
different files simply becomes too ugly and the problem it solves is not big
enough.
Then we just need to make sure pg_dump dumps objects in the same order, and
let diff take care of the rest of the challenge. :)

Brb with a new patch.

--
Best regards,

Joel Jacobson
Glue Finance

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Wilson (#17)
Re: pg_dump --split patch

David Wilson <david.t.wilson@gmail.com> writes:

On Tue, Dec 28, 2010 at 2:39 PM, Joel Jacobson <joel@gluefinance.com> wrote:

I didn't include the arguments in the file name, as it would lead to very
long file names unless truncated, and since the problem is very limited, I
think we shouldn't include it. It's cleaner with just the name part of the
tag in the file name.

Why not place all overloads of a function within the same file? Then,
assuming you order them deterministically within that file, we sidestep the
file naming issue and maintain useful diff capabilities, since a diff of the
function's file will show additions or removals of various overloaded
versions.

If you've solved the deterministic-ordering problem, then this entire
patch is quite useless. You can just run a normal dump and diff it.

regards, tom lane

#20Joel Jacobson
joel@gluefinance.com
In reply to: Tom Lane (#19)
Re: pg_dump --split patch

2010/12/29 Tom Lane <tgl@sss.pgh.pa.us>

If you've solved the deterministic-ordering problem, then this entire
patch is quite useless. You can just run a normal dump and diff it.

No, that's only half true.

Diff will do a good job minimizing the "size" of the diff output, yes, but
such a diff is still quite useless if you want to quickly grasp the context
of the change.

If you have a hundreds of functions, just looking at the changed source code
is not enough to figure out which functions were modified, unless you have
the brain power to memorize every single line of code and are able to figure
out the function name just by looking at the old and new line of codes.

To understand a change to my database functions, I would start by looking at
the top-level, only focusing on the names of the functions
modified/added/removed.
At this stage, you want as little information as possible about each change,
such as only the names of the functions.
To do this, get a list of changes functions, you cannot compare two full
schema plain text dumps using diff, as it would only reveal the lines
changed, not the name of the functions, unless you are lucky to get the name
of the function within the (by default) 3 lines of copied context.

While you could increase the number of copied lines of context to a value
which would ensure you would see the name of the function in the diff, that
is not feasible if you want to quickly "get a picture" of the code areas
modified, since you would then need to read through even more lines of diff
output.

For a less database-centric system where you don't have hundreds of stored
procedures, I would agree it's not an issue to keep track of changes by
diffing entire schema files, but for extremely database-centric systems,
such as the one we have developed at my company, it's not possible to "get
the whole picture" of a change by analyzing diffs of entire schema dumps.

The patch has been updated:

*) Only spit objects with a namespace (schema) not being null
*) Append all objects of same tag (name) of same type (desc) of same
namespace (schema) to the same file (i.e., do not append -2, -3, like
before) (Suggested by David Wilson, thanks.)

I also tested to play around with "ORDER BY pronargs" and "ORDER BY pronargs
DESC" to the queries in getFuncs() in pg_dump.c, but it had no effect to the
order the functions of same name but different number of arguments were
dumped.
Perhaps functions are already sorted?
Anyway, it doesn't matter that much, keeping all functions of the same name
in the same file is a fair trade-off I think. The main advantage is the
ability to quickly get a picture of the names of all changed functions,
secondly to optimize the actual diff output.

--
Best regards,

Joel Jacobson
Glue Finance

E: jj@gluefinance.com
T: +46 70 360 38 01

Postal address:
Glue Finance AB
Box 549
114 11 Stockholm
Sweden

Visiting address:
Glue Finance AB
Birger Jarlsgatan 14
114 34 Stockholm
Sweden

Attachments:

pg-dump-split-plain-text-files-9.1devel.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-9.1devel.patchDownload+90-0
pg-dump-split-plain-text-files-9.1alpha2.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-9.1alpha2.patchDownload+90-0
pg-dump-split-plain-text-files-8.4.6.patchapplication/octet-stream; name=pg-dump-split-plain-text-files-8.4.6.patchDownload+90-0
#21Andrew Dunstan
andrew@dunslane.net
In reply to: Joel Jacobson (#20)
#22Joel Jacobson
joel@gluefinance.com
In reply to: Andrew Dunstan (#21)
#23Aidan Van Dyk
aidan@highrise.ca
In reply to: Joel Jacobson (#22)
#24Joel Jacobson
joel@gluefinance.com
In reply to: Aidan Van Dyk (#23)
#25Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Joel Jacobson (#24)
#26Aidan Van Dyk
aidan@highrise.ca
In reply to: Gurjeet Singh (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Aidan Van Dyk (#26)
#28Joel Jacobson
joel@gluefinance.com
In reply to: Tom Lane (#27)
#29Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Joel Jacobson (#1)
#30Joel Jacobson
joel@gluefinance.com
In reply to: Dimitri Fontaine (#29)
#31Robert Treat
xzilla@users.sourceforge.net
In reply to: Joel Jacobson (#30)
#32Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#8)
#33Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#21)
#34Dmitry Koterov
dmitry@koterov.ru
In reply to: Joel Jacobson (#28)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Dmitry Koterov (#34)
#36Joel Jacobson
joel@gluefinance.com
In reply to: Robert Haas (#35)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#35)
#38Joel Jacobson
joel@gluefinance.com
In reply to: Tom Lane (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#37)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#39)
#41Joel Jacobson
joel@gluefinance.com
In reply to: Tom Lane (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Joel Jacobson (#41)
#43Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#39)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#43)
#45Hannu Krosing
hannu@tm.ee
In reply to: Joel Jacobson (#1)
#46Hannu Krosing
hannu@tm.ee
In reply to: Joel Jacobson (#12)
#47Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#15)
#48Greg Smith
gsmith@gregsmith.com
In reply to: Joel Jacobson (#20)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#40)
#50Marko Tiikkaja
marko@joh.to
In reply to: Joel Jacobson (#1)
#51Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Marko Tiikkaja (#50)
#52Marko Tiikkaja
marko@joh.to
In reply to: Dimitri Fontaine (#51)
#53Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Marko Tiikkaja (#52)
#54Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Dimitri Fontaine (#53)
#55Andrew Dunstan
andrew@dunslane.net
In reply to: Dimitri Fontaine (#54)
#56Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Marko Tiikkaja (#52)