Schema version management

Started by Joel Jacobsonalmost 14 years ago76 messageshackers
Jump to latest
#1Joel Jacobson
joel@trustly.com

Hi,

I just read a very interesting post about "schema version management".

Quote: "You could set it up so that every developer gets their own
test database, sets up the schema there, takes a dump, and checks that
in. There are going to be problems with that, including that dumps
produced by pg_dump are ugly and optimized for restoring, not for
developing with, and they don't have a deterministic output order." (
http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
)

Back in December 2010, I suggested a new option to pg_dump, --split,
which would write the schema definition of each object in separate
files:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php

Instead of a huge plain text schema file, impossible to version
control, all tables/sequences/views/functions are written to separate
files, allowing the use of a version control software system, such as
git, to do proper version controlling.

The "deterministic output order" problem mentioned in the post above,
is not a problem if each object (table/sequence/view/function/etc) is
written to the same filename everytime.
No matter the order, the tree of files and their content will be
identical, no matter the order in which they are dumped.

I remember a lot of hackers were very positive about this option, but
we somehow failed to agree on the naming of files in the tree
structure. I'm sure we can work that out though.

I use this feature in production, I have a cronjob which does a dump
of the schema every hour, committing any eventual changes to a
separate git branch for each database installation, such as
production, development and test.
If no changes to the schema have been made, nothing will be committed
to git since none of the files have changed.

It is then drop-dead simple to diff two different branches of the
database schema, such as development or production, or diffing
different revisions allowing point-in-time comparison of the schema.

This is an example of the otuput of a git log --summary for one of the
automatic commits to our production database's git-repo:

--
commit 18c31f8162d851b0dac3bad7e80529ef2ed18be3
Author: Production Database <production.database@trustly.com>
Date:   Fri May 4 15:00:04 2012 +0200

    Update of database schema Linux DB0 2.6.26-2-amd64 #1 SMP Wed Aug
19 22:33:18 UTC 2009 x86_64 GNU/Linux Fri, 04 May 2012 15:00:04 +0200

 create mode 100644
gluepay-split/public/CONSTRAINT/openingclosingbalances_pkey.sql
 create mode 100644
gluepay-split/public/CONSTRAINT/openingclosingbalances_source_key.sql
 create mode 100644 gluepay-split/public/SEQUENCE/seqopeningclosingbalance.sql
 create mode 100644 gluepay-split/public/TABLE/openingclosingbalances.sql
--

Here we can see we apparently deployed a new table,
"openingclosingbalances" around Fri May 4 15:00:04.

Without any manual work, I'm able to follow all changes actually
_deployed_ in each database.

At my company, a highly database-centric stored-procedure intensive
business dealing with mission-critical monetary transactions, we've
been using this technique to successfully do schema version management
without any hassle for the last two years.

Hopefully this can add to the list of various possible _useful_ schema
version management methods.

Best regards,

Joel Jacobson

#2Daniel Farina
daniel@heroku.com
In reply to: Joel Jacobson (#1)
Re: Schema version management

On Sun, May 20, 2012 at 12:41 PM, Joel Jacobson <joel@trustly.com> wrote:

Hi,

I just read a very interesting post about "schema version management".

Quote: "You could set it up so that every developer gets their own
test database, sets up the schema there, takes a dump, and checks that
in. There are going to be problems with that, including that dumps
produced by pg_dump are ugly and optimized for restoring, not for
developing with, and they don't have a deterministic output order." (
http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
)

I think you are absolutely right, but I'm not sure if teaching pg_dump
a new option is the best idea. It's a pretty complex program as-is.
I've also heard some people who really wish pg knew how to self-dump
for valid reasons.

It sounds like some of the catalog wrangling and cycle-breaking
properties of pg_dump could benefit from being exposed stand-alone,
but unfortunately that's not a simple task, especially if you want to
do The Right Thing and have pg_dump link that code, given pg_dump's
criticality.

pg_extractor is a new/alternative take on the database copying
problem, maybe you could have a look at that?

--
fdr

#3Joel Jacobson
joel@trustly.com
In reply to: Daniel Farina (#2)
Re: Schema version management

On Mon, May 21, 2012 at 8:08 AM, Daniel Farina <daniel@heroku.com> wrote:

I think you are absolutely right, but I'm not sure if teaching pg_dump
a new option is the best idea.  It's a pretty complex program as-is.
I've also heard some people who really wish pg knew how to self-dump
for valid reasons.

Complex program? Yes, pg_dump it is extremely complex, I wouldn't want
to touch any of the code. A rewrite is probably close to impossible.

Complex patch? No. It's 102 lines of code and doesn't change any of
the existing code in pg_dump, it simply adds some lines writing out
the objects to separate files. Have a look at the patch, it's super
simple.

It sounds like some of the catalog wrangling and cycle-breaking
properties of pg_dump could benefit from being exposed stand-alone,
but unfortunately that's not a simple task, especially if you want to
do The Right Thing and have pg_dump link that code, given pg_dump's
criticality.

I agree it's not a simple task, and it's probably not something anyone
will fix in the near future.
The --split option doesn't aim to solve this problem either. That's a
different problem, and it's not a problem I have.

pg_extractor is a new/alternative take on the database copying
problem, maybe you could have a look at that?

It's just sad realizing people need to some up with hacks and
work-arounds to solve a obvious real-life problem, easily fixed inside
pg_dump with 102 lines of drop-dead simple code, not touching any of
the logics or flows in pg_dump.

I can't even image how many hours coders have wasted hacking together
tools like pg_extractor just to circumvent the stupid fact pg_dump
can't do this natively.

The pg_extractor is way more complex than my suggested patch, it's 974
lines of perl codes, as opposed to 102 lines of simple code in the
patch.
The pg_extractor also does a lot more than simply splitting objects
into separate files, like executing svn commands.

The splitting of objects into separate files should clearly be the
responsibility of pg_dump.
It would allow you to easily version control the schema files your
self with any version control software system, such as svn, git, etc.

I'm sure pg_extractor does it best to achieve the objective, but even
if it does, I would never trust it for production usage, version
controlling your production schema is far too important to trust any
tool not part of the mainline distribution of postgres. And personally
I don't have any problem, I've been using the --split option for two
years, I just feel sorry for the rest of the postgres community,
unaware of how to solve this problem, having to hack together their
own little tools, or be "lucky" finding some existing hack.

#4Daniel Farina
daniel@heroku.com
In reply to: Joel Jacobson (#3)
Re: Schema version management

On Sun, May 20, 2012 at 7:36 PM, Joel Jacobson <joel@trustly.com> wrote:

On Mon, May 21, 2012 at 8:08 AM, Daniel Farina <daniel@heroku.com> wrote:

I think you are absolutely right, but I'm not sure if teaching pg_dump
a new option is the best idea.  It's a pretty complex program as-is.
I've also heard some people who really wish pg knew how to self-dump
for valid reasons.

Complex program? Yes, pg_dump it is extremely complex, I wouldn't want
to touch any of the code. A rewrite is probably close to impossible.

I wouldn't be so sure about that...

Complex patch? No. It's 102 lines of code and doesn't change any of
the existing code in pg_dump, it simply adds some lines writing out
the objects to separate files. Have a look at the patch, it's super
simple.

Ah. I did not know there was a patch already out there -- I did not
somehow get that , as it then can be audited in its precise functionality.

It sounds like some of the catalog wrangling and cycle-breaking
properties of pg_dump could benefit from being exposed stand-alone,
but unfortunately that's not a simple task, especially if you want to
do The Right Thing and have pg_dump link that code, given pg_dump's
criticality.

It's just sad realizing people need to some up with hacks and
work-arounds to solve a obvious real-life problem, easily fixed inside
pg_dump with 102 lines of drop-dead simple code, not touching any of
the logics or flows in pg_dump.

I can't even image how many hours coders have wasted hacking together
tools like pg_extractor just to circumvent the stupid fact pg_dump
can't do this natively.

My next question would be how this might relate to the directory dump
format. For example, is it an embellishment of that? It seems at
fist glance that whatever this patch might be a cousin of that
feature. Or, is it superseded? The documentation is clear that tables
are given their own files, but doesn't say much about how other schema
objects are stored, so they may or may not be useful to your needs.

Also, now that I look more carefully, there was a lot of conversation
about this patch; it seems like what you are doing now is reporting
its successful use, and I did not understand that by reading the
abstract of your email. And, beyond that, do we have a summary of the
open questions that prevented it from being committed?

I'm sure pg_extractor does it best to achieve the objective, but even
if it does, I would never trust it for production usage, version
controlling your production schema is far too important to trust any
tool not part of the mainline distribution of postgres. And personally
I don't have any problem, I've been using the --split option for two
years, I just feel sorry for the rest of the postgres community,
unaware of how to solve this problem, having to hack together their
own little tools, or be "lucky" finding some existing hack.

My thinking is that confidence would be increased if there was a piece
of code that handled a lot of the catalog munging et al that is part
of pg_dump that *is* maintained by postgres so other projects can more
convincingly add a correct veneer.

As a meta-comment, all I did was ask some polite questions. You could
have politely disqualified pg_extractor and spared some of the
language without having gotten anything less done.

--
fdr

#5Joel Jacobson
joel@trustly.com
In reply to: Daniel Farina (#4)
Re: Schema version management

On Mon, May 21, 2012 at 10:06 AM, Daniel Farina <daniel@heroku.com> wrote:

Also, now that I look more carefully, there was a lot of conversation
about this patch; it seems like what you are doing now is reporting
its successful use, and I did not understand that by reading the
abstract of your email.  And, beyond that, do we have a summary of the
open questions that prevented it from being committed?

Good idea. Here is an attempt to a summary:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02301.php
The initial feedback was on the usage of OIDs as file names.
This was indeed a bad idea and was changed, see
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02314.php
Tom Lane also pointed out it doesn't solve the "randomly different
ordering of rows within a table"-problem.
The rows within a table are not part of the schema. The patch doesn't
attempt to solve that problem.

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02318.php
Gurjeet Singh pointed out the problem with functions sharing the same
name but having different arguments.
As of now, it's not certain they will always be dumped into the same files.
This is a valid point, and needs to be solved in an elegant way.
The arguments needs to be made part of the path somehow.

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02327.php
Another idea Gurjeet had was to hash the object identifier and use
that in the file's name.
Not a bad idea, would look nicer if functions have very many arguments.
Perhaps /[schema]/[object type]/[name]/[hash of arguments].sql

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02329.php
David Wilson suggested placing all overloaded functions within the same file.
"Then, assuming you order them deterministically within that file, we
sidestep the
file naming issue and maintain useful diff capabilities, since a diff of the
function's file will show additions or removals of various overloaded
versions."
This would be a good idea, but falls on pg_dump not outputting the
functions in a deterministic order.

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02496.php
Robert Treat: "I've both enjoyed reading this thread and seeing this wheel
reinvented yet again, and wholeheartedly +1 the idea of building this
directly into pg_dump. (The only thing better would be to make everything
thing sql callable, but that's a problem for another day)."

My thinking is that confidence would be increased if there was a piece
of code that handled a lot of the catalog munging et al that is part
of pg_dump that *is* maintained by postgres so other projects can more
convincingly add a correct veneer.

I totally agree, the most easy place to handle it is within pg_dump.

As a meta-comment, all I did was ask some polite questions.  You could
have politely disqualified pg_extractor and spared some of the
language without having gotten anything less done.

I very much appreciated your answer, and I did in no way mean to be impolite.

Best regards,

Joel

#6Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Joel Jacobson (#5)
Re: Schema version management

On Mon, May 21, 2012 at 5:03 AM, Joel Jacobson <joel@trustly.com> wrote:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02301.php
The initial feedback was on the usage of OIDs as file names.
This was indeed a bad idea and was changed, see
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02318.php
Gurjeet Singh pointed out the problem with functions sharing the same
name but having different arguments.
As of now, it's not certain they will always be dumped into the same files.
This is a valid point, and needs to be solved in an elegant way.
The arguments needs to be made part of the path somehow.

This is interesting at Jane Street we actually have a small tool
that "parses" the output of pg_dump. (Well applies a set of regular
expressions plus a little bit guesswork). We use this to do three things
all of which I would love to see supported by postgres tool chain proper:

1) split the output into one file per thing (basically as per this
thread) each file named <type>_<name>_<running-integer> for use
with a VCS. So if we have an overloaded function foo we end up with
several function_foo_1.sql function_foo_2.sql ... The order of the
enumeration is just the order the functions occurred in the pg_dump
which seems to be stable and therefore good enough.

2) extract a patch. You give the tool the name of one or more roots
(e.g. a table or set of tables you want to modify). It finds all
things that depend on it (well sort of just turn the body of each
definition into a list of words and a depends on b if the name of b
occurrs in a). Do a topological sort (if there are cycles because
of the hack dependency check break them but continue and produce a
warning). Output a file that first drops the definitions in inverse
dependency order and then recreates them (in dependency order).
The file starts with a begin but does NOT end with a commit so you
are forced to enter it yourself.

This tool is fantastic if you have a big set of plpgsql functions as
it is otherwise hard to make sure that you have modified all places
when refactoring, changing a column, etc...

3) Find all leaves. E.g. do the topsort on the whole pg_dump and list
the names of all things nothing depends on. This is mostly useful if
you want to make sure that you are not accumulating cruft that isn't
used by anything. Of course you separately need a list or knowledge
about the entry points of your application(s).

Cheers,

Bene

#7Daniel Farina
daniel@heroku.com
In reply to: Joel Jacobson (#5)
Re: Schema version management

On Sun, May 20, 2012 at 9:03 PM, Joel Jacobson <joel@trustly.com> wrote:

On Mon, May 21, 2012 at 10:06 AM, Daniel Farina <daniel@heroku.com> wrote:

Also, now that I look more carefully, there was a lot of conversation
about this patch; it seems like what you are doing now is reporting
its successful use, and I did not understand that by reading the
abstract of your email.  And, beyond that, do we have a summary of the
open questions that prevented it from being committed?

Good idea. Here is an attempt to a summary:

Thank you, that's very informative. I'd like to reiterate one
question, though, which is something like:

"How do you feel that the since-committed directory-output/input
support in pg_dump/pg_restore could or should influence your patch, if
at all?"

It seems like now that there is support for spitting out a bunch of
files in a directory for pg_dump that is now going to be supported for
a long time that a new feature like yours might be more cohesive if it
somehow played with that. I must confess I haven't read the patch in
detail, especially if it has been updated, but back then there was no
multi-file output mode from pg_dump, and now there is one. My naive
understanding is this would be adding a second one as-is, but I wonder
if that is strictly necessary to fulfill the use case.

--
fdr

#8Joel Jacobson
joel@trustly.com
In reply to: Daniel Farina (#7)
Re: Schema version management

On Tue, May 22, 2012 at 3:30 AM, Daniel Farina <daniel@heroku.com> wrote:

Thank you, that's very informative.  I'd like to reiterate one
question, though, which is something like:

"How do you feel that the since-committed directory-output/input
support in pg_dump/pg_restore could or should influence your patch, if
at all?"

The directory format fulfills a different purpose. The tables are
split into files, where each file name gets a number. Functions are
not split into files, they are defined in the table of content file,
toc.dat.

Example:

joel@Joel-Jacobsons-MacBook-Pro ~ $ pg_dump -F d -f /Users/joel/test
joel@Joel-Jacobsons-MacBook-Pro ~ $ ls -la test
total 24
drwx------ 5 joel staff 170 May 22 07:16 .
drwx------+ 130 joel staff 4488 May 22 07:16 ..
-rw-r--r-- 1 joel staff 38 May 22 07:16 2116.dat.gz
-rw-r--r-- 1 joel staff 39 May 22 07:16 2117.dat.gz
-rw-r--r-- 1 joel staff 2265 May 22 07:16 toc.dat

This is a good feature for its purpose, but doesn't provide a solution
for the schema version management problem.

It seems like now that there is support for spitting out a bunch of
files in a directory for pg_dump that is now going to be supported for
a long time that a new feature like yours might be more cohesive if it
somehow played with that.  I must confess I haven't read the patch in
detail, especially if it has been updated, but back then there was no
multi-file output mode from pg_dump, and now there is one.  My naive
understanding is this would be adding a second one as-is, but I wonder
if that is strictly necessary to fulfill the use case.

If one want to reuse the splitting to files-code of the directory
format, maybe the existing option -F d could be tweaked to output in
both a a machine-readable format (current way), and also a
human-friendly tree of files and content (like suggested by my patch).

I wonder what the option would be called then, having two chars
options is not an option I guess, maybe -F t for "tree" instead of
"directory", as the -F d option only dumps to a single directory and
not a tree-structure?

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Joel Jacobson (#8)
Re: Schema version management

On 05/21/2012 08:25 PM, Joel Jacobson wrote:

On Tue, May 22, 2012 at 3:30 AM, Daniel Farina<daniel@heroku.com> wrote:

Thank you, that's very informative. I'd like to reiterate one
question, though, which is something like:

"How do you feel that the since-committed directory-output/input
support in pg_dump/pg_restore could or should influence your patch, if
at all?"

The directory format fulfills a different purpose. The tables are
split into files, where each file name gets a number. Functions are
not split into files, they are defined in the table of content file,
toc.dat.

Example:

joel@Joel-Jacobsons-MacBook-Pro ~ $ pg_dump -F d -f /Users/joel/test
joel@Joel-Jacobsons-MacBook-Pro ~ $ ls -la test
total 24
drwx------ 5 joel staff 170 May 22 07:16 .
drwx------+ 130 joel staff 4488 May 22 07:16 ..
-rw-r--r-- 1 joel staff 38 May 22 07:16 2116.dat.gz
-rw-r--r-- 1 joel staff 39 May 22 07:16 2117.dat.gz
-rw-r--r-- 1 joel staff 2265 May 22 07:16 toc.dat

This is a good feature for its purpose, but doesn't provide a solution
for the schema version management problem.

It seems like now that there is support for spitting out a bunch of
files in a directory for pg_dump that is now going to be supported for
a long time that a new feature like yours might be more cohesive if it
somehow played with that. I must confess I haven't read the patch in
detail, especially if it has been updated, but back then there was no
multi-file output mode from pg_dump, and now there is one. My naive
understanding is this would be adding a second one as-is, but I wonder
if that is strictly necessary to fulfill the use case.

If one want to reuse the splitting to files-code of the directory
format, maybe the existing option -F d could be tweaked to output in
both a a machine-readable format (current way), and also a
human-friendly tree of files and content (like suggested by my patch).

I wonder what the option would be called then, having two chars
options is not an option I guess, maybe -F t for "tree" instead of
"directory", as the -F d option only dumps to a single directory and
not a tree-structure?

I have a little utility to write out the data in the TOC as separate
files. It works with both a custom format dump, and the toc.dat file
from a directory format or unpacked tar format dump.

It was originally written as a debugging aid, but can be used to some
extent for schema management as well. It's far from complete, but still
might be useful. See <https://github.com/adunstan/DumpToc&gt;

cheers

andrew

#10Daniel Farina
daniel@heroku.com
In reply to: Joel Jacobson (#8)
Re: Schema version management

On Mon, May 21, 2012 at 5:25 PM, Joel Jacobson <joel@trustly.com> wrote:

If one want to reuse the splitting to files-code of the directory
format, maybe the existing option -F d could be tweaked to output in
both a a machine-readable format (current way), and also a
human-friendly tree of files and content (like suggested by my patch).

I wonder what the option would be called then, having two chars
options is not an option I guess, maybe -F t for "tree" instead of
"directory", as the -F d option only dumps to a single directory and
not a tree-structure?

Is there a reason why the current directory format could not be
adjusted to become more human-readable friendly for mechanical
reasons? I realize there is a backwards compatibility problem, but it
may be better than bloating a new option.

Andrew's approach of reading the TOC also be good...as so pg_dump can
avoid serving the additional master of schema versioning and
development usability in addition to dumping. The TOC is the closest
thing we have to the library-ification of pg_dump in the near-term.
But I don't see how making the directory output format more
human-friendly could be seen as a bad thing overall, except in the
notable axis of implementation complexity. Silly issues like naming
files on different platforms, case sensitivity, and file length
restrictions may rear their ugly head.

I think about this because in addition to the data types and operators
defined in the development process, there are often small tables that
need to be loaded with content and version controlled as well, rather
like userland-equivalents pg_enum entries.

--
fdr

#11Joel Jacobson
joel@trustly.com
In reply to: Daniel Farina (#10)
Re: Schema version management

On Wed, May 23, 2012 at 3:24 AM, Daniel Farina <daniel@heroku.com> wrote:

Is there a reason why the current directory format could not be
adjusted to become more human-readable friendly for mechanical
reasons?  I realize there is a backwards compatibility problem, but it
may be better than bloating a new option.

I like your idea, then the format would be directory, while the option
would be something like, --human-friendly?

Currently, the directory format only dumps the data of tables
into separate files. Everything else goes into the toc.dat file.
To make it work, also the stuff written to the toc.dat file must
be written to separate files.

But I don't see how making the directory output format more
human-friendly could be seen as a bad thing overall, except in the
notable axis of implementation complexity.  Silly issues like naming
files on different platforms, case sensitivity, and file length
restrictions may rear their ugly head.

If the entire function identity arguments would be included in the filename,
two dumps of the same schema in two different databases
would be guaranteed to produce the same dump.

This would render some very long filenames for functions with many arguments,
but this problem could at least be reduced by using the shorter aliases for each
data type, as "varchar" instead of "character varying" and "timestamptz"
instead of "timestamp with time zone", etc.

http://www.postgresql.org/docs/devel/static/datatype.html#DATATYPE-TABLE

Also, to get even more space, as the name of the function can be long too,
the function name could be made a directory, and the different overloaded types
different files, e.g:

/public/FUNCTION/myfunc/int.sql
/public/FUNCTION/myfunc/int_timestamptz.sql

And functions with no arguments are written to a single file
(suffic .sql to avoid conflict with eventual directory name for function):

/public/FUNCTION/myfunc.sql

I think about this because in addition to the data types and operators
defined in the development process, there are often small tables that
need to be loaded with content and version controlled as well, rather
like userland-equivalents pg_enum entries.

Is there a term for such tables? I use the term "lookup tables", but perhaps
there is a better one?

In my schema, they typically maps statusids, stateids,
etc to human friendly names.

E.g., if Orders is a huge table for all orders, I might have a
OrderStatuses table to
lookup all the OrderStatusID columns in Orders.
Orders.OrderStatusID -fk-> OrderStatuses.OrderStatusID
OrderStatuses.Name is unqiue and contains the human friendly name of the status.

These small lookup tables also needs to be version controlled of course.

This is a tricky one though, because you might have small tables with base data,
but with references to other huge tables, which you don't want to
include in your
automatically version controlled schema dump.

I solved this problem by creating a quite complex recursive plpgsql function,
resolving all dependencies and joining only the rows from each table required,
allowing you to specify a regex matching a list of tables, which in
turn resolves
to all tables they have references to, and dumps these tables too, but only the
required rows. The result is a dump of each such table into a separate file,
in a restorable order not causing any dependency problems.
Then I have a similar function to do the restoring.

I use this approach to build a restorable clean test database of any version of
the system, may it be the production or some developer's local version of it.

And also, not to forget, to make it work all the sequences also needs
to be restarted
to the same values as in the original database after the dump is restored.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#11)
Re: Schema version management

Joel Jacobson <joel@trustly.com> writes:

If the entire function identity arguments would be included in the filename,
two dumps of the same schema in two different databases
would be guaranteed to produce the same dump.

This would render some very long filenames for functions with many arguments,

Thus, not implausibly, causing the dump to fail entirely on some
filesystems. Case sensitivity, encoding issues, and special characters
in names (eg slashes or backslashes, depending on platform) are
additional pain points. This does not sound like a good plan from here.

Taking a step or two back, it seems to me that the thrust of your
proposal is essentially to throw away all dump ordering information,
which does not seem like a particularly good idea either. It certainly
will not lead to a dump that can be restored reliably. If the use-case
for this is database comparisons, I think we'd be a lot better off to
write a postprocessing tool for regular dumps to perform such
comparisons, rather than whacking pg_dump around to the point where it's
unable to perform its primary function.

regards, tom lane

#13Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#12)
Re: Schema version management

On Wed, May 23, 2012 at 9:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thus, not implausibly, causing the dump to fail entirely on some
filesystems.  Case sensitivity, encoding issues, and special characters
in names (eg slashes or backslashes, depending on platform) are
additional pain points.  This does not sound like a good plan from here.

This is true, which means some users won't be able to use the feature,
because they are using an ancient OS or have function names with slashes,
hm, is it even possible to have function names with slashes?

The maximum length of tables, functions etc in postgres is 63 characters.
A function in postgres can have at most 100 arguments.
The absolute majority of users run operating systems allowing
at least 255 characters, http://en.wikipedia.org/wiki/Comparison_of_file_systems

I suppose you have a lot more experience of what postgres installations exists
in the world. Do you think it's common databases have non-ascii problematic
characters in object names?

Is it a project policy all features of all standard tools must be
useful for all users
on all platforms on all databases? Or is it acceptable if some features are only
useable for, say, 90% of the users?

Taking a step or two back, it seems to me that the thrust of your
proposal is essentially to throw away all dump ordering information,
which does not seem like a particularly good idea either.  It certainly
will not lead to a dump that can be restored reliably.  If the use-case
for this is database comparisons, I think we'd be a lot better off to
write a postprocessing tool for regular dumps to perform such
comparisons, rather than whacking pg_dump around to the point where it's
unable to perform its primary function.

Not at all, the ordering information is not thrown away, it is preserved
in the dump file specified by the -f option, from which each split file
is included using \i

Example, this is an extract of the -f dump file in my database:

--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET client_encoding = 'SQL_ASCII';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

--
-- Name: pgx_diag; Type: SCHEMA; Schema: -; Owner: postgres
--

CREATE SCHEMA pgx_diag;

ALTER SCHEMA pgx_diag OWNER TO postgres;

-- ... some more schemas, languages etc ...
-- ... and then all the included files:

\i /home/postgres/database/gluepay-split/public/TYPE/dblink_pkey_results.sql
\i /home/postgres/database/gluepay-split/public/TYPE/r_matchedwithdrawal.sql
\i /home/postgres/database/gluepay-split/public/TYPE/r_unapprovedwithdrawal.sql
\i /home/postgres/database/gluepay-split/public/TYPE/ukaccountvalidationchecktype.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/check_name.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/describe_entityid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql
-- ... all the objects ..
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerid_fkey.sql
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerstatusid_fkey.sql
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workertypeid_fkey.sql

-- .. and after all the included files comes permissions and stuff:

--
-- Name: public; Type: ACL; Schema: -; Owner: postgres
--

REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;

#14Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#12)
Re: Schema version management

On the topic on fixing pg_dump to dump in a predictable order, can
someone please update me on the current state of the problem?

I've read though pg_dump_sort.c, and note objects are first sorted in
type/name-based ordering, then topologically sorted in a way which
"minimize unnecessary rearrangement".

How come this not always generates a predictable order? Any ideas on
how to fix the problem? If someone gives me a hint I might make an
effort trying to implement the idea.

If pg_dump would dump in a predictable order, it would make sense to
dump all overloaded versions of functions sharing the same name in the
same file.

Then it would be _guaranteed_ two different databases committing their
schema to a shared VCS commit exactly the same files if the schema is
the same, which is not guaranteed unless the dump order is
predictable.

Having thought about it, I agree the idea with arguments in filenames
is, probably possible, but suboptimal.
Much better writing all overloaded functions to the same file and
fixing the predictable dump order problem.

#15Robert Haas
robertmhaas@gmail.com
In reply to: Joel Jacobson (#13)
Re: Schema version management

On Tue, May 22, 2012 at 11:31 PM, Joel Jacobson <joel@trustly.com> wrote:

This is true, which means some users won't be able to use the feature,
because they are using an ancient OS or have function names with slashes,
hm, is it even possible to have function names with slashes?

Sure. If you quote the function name, you can put anything you want
in there. Note that Windows disallows a whole bunch of special
characters in filenames, while UNIX-like systems tend to disallow only
slash.

I suppose you have a lot more experience of what postgres installations exists
in the world. Do you think it's common databases have non-ascii problematic
characters in object names?

Is it a project policy all features of all standard tools must be
useful for all users
on all platforms on all databases? Or is it acceptable if some features are only
useable for, say, 90% of the users?

There are cases where we permit features that only work on some
platforms, but it's rare. Usually, we do this only when the platform
lacks some API that exists elsewhere. For example, collations and
prefetching are not supported on Windows because the UNIX APIs we use
don't exist there.

In this case, it seems like you could work around the problem by, say,
URL-escaping any characters that can't be used in an unquoted
identifier. Of course that might make the file name long enough to
hit the platform-specific file name limit. Not sure what to do about
that. The basic idea you're proposing here has been proposed a number
of times before, but it's always fallen down over questions of (1)
what do do with very long object names or those containing special
characters and (2) objects (like functions) for which schema+name is
not a unique identifier.

I don't think either of these problems ought to be a complete
show-stopper. It seems to me that the trade-off is that when object
names are long, contain special characters, or are overloaded, we'll
have to munge the names in some way to prevent collisions. That could
mean that the names are not 100% stable, which would possibly produce
some annoyance if you're using a VCS to track changes, but maybe
that's an acceptable trade-off, because it shouldn't happen very
often. If we could guararantee that identifiers less than 64
characters which are not overloaded and contain no special characters
requiring quoting end up in an eponymous file, I think that would be
good enough to make most of our users pretty happy. In other cases, I
think the point would just be to make it work (with a funny name)
rather than fail.

\i /home/postgres/database/gluepay-split/public/TYPE/dblink_pkey_results.sql
\i /home/postgres/database/gluepay-split/public/TYPE/r_matchedwithdrawal.sql
\i /home/postgres/database/gluepay-split/public/TYPE/r_unapprovedwithdrawal.sql
\i /home/postgres/database/gluepay-split/public/TYPE/ukaccountvalidationchecktype.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/check_name.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/describe_entityid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql
-- ... all the objects ..
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerid_fkey.sql
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerstatusid_fkey.sql
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workertypeid_fkey.sql

It would be better to use \ir here rather than hard-code path names, I
think. Then you'd only need to require that all the files be in the
same directory, rather than requiring them to be at a certain
hard-coded place in the filesystem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Joel Jacobson
joel@trustly.com
In reply to: Robert Haas (#15)
Re: Schema version management

Robert, thank you for keeping this thread alive.

Hopefully some more will join the discussion.

I'm still hopeful the community can manage to agree upon acceptable
tradeoffs and work-arounds to make this possible.

I think the benefits clearly outweighs the minor issues of filenames,
dumping order, etc.

On Tue, Jun 26, 2012 at 6:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think either of these problems ought to be a complete
show-stopper. It seems to me that the trade-off is that when object
names are long, contain special characters, or are overloaded, we'll
have to munge the names in some way to prevent collisions. That could
mean that the names are not 100% stable, which would possibly produce
some annoyance if you're using a VCS to track changes, but maybe
that's an acceptable trade-off, because it shouldn't happen very
often. If we could guararantee that identifiers less than 64
characters which are not overloaded and contain no special characters
requiring quoting end up in an eponymous file, I think that would be
good enough to make most of our users pretty happy. In other cases, I
think the point would just be to make it work (with a funny name)
rather than fail.

I agree. It's not a problem if the filename is not identical to the name of
the object, as long as the same name generates the same filename on
all architectures. Url escaping would work, but converting all non-ascii
characters to ascii would be nicer, and dropping any problematic characters,
or replacing them with "_" or any other suitable character.

For the small fraction of users how have managed to find a good reason
to name a function "this/is\a/good.name/of/a\function.." the filename
of such a function would be "this_is_a_good_name_of_a_function".

As long as the objects are dumped in the same order, there will be no
merge problems when two developers commit changes of the same
file. I think pg_dump does a reasonable job already making sure the order is
always the same. How big is the problem, really?

It would of course be a little easier to keep track of changes and do
merging
if all overloaded functions would be kept in separate files, but I see that
as a
minor feature request. As long as all objects with the same name are kept in
separate files, that's good enough for my needs, and I have _a lot_ of
functions,
whereof quite a few are overloaded.

\i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql

It would be better to use \ir here rather than hard-code path names, I
think. Then you'd only need to require that all the files be in the
same directory, rather than requiring them to be at a certain
hard-coded place in the filesystem.

I fully agree!
I didn't know about the \ir feature.

Best regards,

Joel Jacobson

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Joel Jacobson (#16)
Re: Schema version management

On ons, 2012-06-27 at 10:02 +0200, Joel Jacobson wrote:

Robert, thank you for keeping this thread alive.

Hopefully some more will join the discussion.

I'm still hopeful the community can manage to agree upon acceptable
tradeoffs and work-arounds to make this possible.

I think this idea has merit. Prepare a patch and put it into the next
commit fest.

I think the benefits clearly outweighs the minor issues of filenames,
dumping order, etc.

I see the problem that since the dump order is in general not
deterministic, this will cause random reordering in your master file
that includes all the individual files.

Then again, making the dump order deterministic is a problem that can be
solved (I suppose), so maybe starting there would be a good step. But
it will require a small amount of in-depth pg_dump hacking.

#18Joel Jacobson
joel@trustly.com
In reply to: Peter Eisentraut (#17)
Re: Schema version management

On Tue, Jul 3, 2012 at 7:49 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I think this idea has merit. Prepare a patch and put it into the next
commit fest.

Glad to hear, I'm on it!

I see the problem that since the dump order is in general not
deterministic, this will cause random reordering in your master file
that includes all the individual files.

Then again, making the dump order deterministic is a problem that can be
solved (I suppose), so maybe starting there would be a good step. But
it will require a small amount of in-depth pg_dump hacking.

I just made a test, where I created objects in different order and compared
the dumps.
It appears pg_dump dumps objects in alphabetically sorted order.
This works fine for most objects, but not for overloaded functions, in
which case
they are dumped in oid order.

Are there any other cases than overloaded functions, where the dump order
isn't deterministic?

While waiting for your reply, I'll be working on fixing the problem with
overloaded functions.

#19Robert Haas
robertmhaas@gmail.com
In reply to: Joel Jacobson (#18)
Re: Schema version management

On Wed, Jul 4, 2012 at 9:02 AM, Joel Jacobson <joel@trustly.com> wrote:

On Tue, Jul 3, 2012 at 7:49 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I think this idea has merit. Prepare a patch and put it into the next
commit fest.

Glad to hear, I'm on it!

I see the problem that since the dump order is in general not
deterministic, this will cause random reordering in your master file
that includes all the individual files.

Then again, making the dump order deterministic is a problem that can be
solved (I suppose), so maybe starting there would be a good step. But
it will require a small amount of in-depth pg_dump hacking.

I just made a test, where I created objects in different order and compared
the dumps.
It appears pg_dump dumps objects in alphabetically sorted order.
This works fine for most objects, but not for overloaded functions, in which
case
they are dumped in oid order.

Are there any other cases than overloaded functions, where the dump order
isn't deterministic?

While waiting for your reply, I'll be working on fixing the problem with
overloaded functions.

My vote is - when there's an overloaded function, put each version in
its own file. And name the files something like
functionname_something.sql. And just document that something may not
be entirely stable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Joel Jacobson
joel@trustly.com
In reply to: Robert Haas (#19)
Re: Schema version management

On Thu, Jul 5, 2012 at 2:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

My vote is - when there's an overloaded function, put each version in
its own file. And name the files something like
functionname_something.sql. And just document that something may not
be entirely stable.

I would agree that's better if the dump order isn't deterministic.

However, it looks like an easy fix to make the dump order deterministic:
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00232.php

If the dump order is deterministic, I think its cleaner to put all versions
in the same file.

Benefits:
+ Pretty looking filename
+ Same file structure for all object types, no special exception for
functions
#21Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Joel Jacobson (#20)
#22Andrew Dunstan
andrew@dunslane.net
In reply to: Gurjeet Singh (#21)
#23Michael Glaesemann
grzm@seespotcode.net
In reply to: Andrew Dunstan (#22)
#24Vik Fearing
vik@postgresfriends.org
In reply to: Michael Glaesemann (#23)
#25Joel Jacobson
joel@trustly.com
In reply to: Vik Fearing (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#25)
#27Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#26)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#26)
#29Joel Jacobson
joel@trustly.com
In reply to: Alvaro Herrera (#28)
#30Michael Glaesemann
grzm@seespotcode.net
In reply to: Alvaro Herrera (#28)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Glaesemann (#30)
#32Michael Glaesemann
grzm@seespotcode.net
In reply to: Alvaro Herrera (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#31)
#34Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#27)
#36Chris Browne
cbbrowne@acm.org
In reply to: Tom Lane (#33)
#37Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#35)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Chris Browne (#36)
#39David E. Wheeler
david@kineticode.com
In reply to: Andrew Dunstan (#22)
#40Aidan Van Dyk
aidan@highrise.ca
In reply to: David E. Wheeler (#39)
#41Josh Berkus
josh@agliodbs.com
In reply to: Aidan Van Dyk (#40)
#42Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Alvaro Herrera (#28)
#43Chris Browne
cbbrowne@acm.org
In reply to: Dimitri Fontaine (#42)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chris Browne (#43)
#45Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#44)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#45)
#47Marc Mamin
M.Mamin@intershop.de
In reply to: Robert Haas (#46)
#48Peter Eisentraut
peter_e@gmx.net
In reply to: Dimitri Fontaine (#42)
#49Aidan Van Dyk
aidan@highrise.ca
In reply to: Peter Eisentraut (#48)
#50Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Aidan Van Dyk (#49)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#50)
#52Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#51)
#53Peter Eisentraut
peter_e@gmx.net
In reply to: Aidan Van Dyk (#49)
#54Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#51)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#54)
#56Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#55)
#57Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#56)
#58Magnus Hagander
magnus@hagander.net
In reply to: Peter Eisentraut (#56)
#59Merlin Moncure
mmoncure@gmail.com
In reply to: Joel Jacobson (#1)
#60Joel Jacobson
joel@trustly.com
In reply to: Merlin Moncure (#59)
#61Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#59)
#62Peter Eisentraut
peter_e@gmx.net
In reply to: Andrew Dunstan (#57)
#63Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#62)
#64Joel Jacobson
joel@trustly.com
In reply to: Merlin Moncure (#61)
#65Joel Jacobson
joel@trustly.com
In reply to: Alvaro Herrera (#63)
#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#65)
#67Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#66)
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#67)
#69Joel Jacobson
joel@trustly.com
In reply to: Tom Lane (#68)
#70Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#68)
#71Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#63)
#72Peter Eisentraut
peter_e@gmx.net
In reply to: Joel Jacobson (#69)
#73Joel Jacobson
joel@trustly.com
In reply to: Peter Eisentraut (#72)
#74Peter Eisentraut
peter_e@gmx.net
In reply to: Joel Jacobson (#73)
#75Joel Jacobson
joel@trustly.com
In reply to: Peter Eisentraut (#74)
#76Joel Jacobson
joel@trustly.com
In reply to: Joel Jacobson (#75)