WIP Patch: pg_dump structured

Started by Attila Sokiabout 3 years ago4 messageshackers

pgsql@attilasoki.com

about 3 years ago

Hi all,

I was looking for a way to track actual schema changes after database migrations
in a VCS. Preferably, the schema definition should come from a trusted source
like pg_dump and should consist of small files.
This patch was born out of that need.

This patch adds the structured output format to pg_dump.
This format is a plaintext output split up into multiple files and the
resulting small files are stored in a directory path based on the dumped object.
This format can be restored by feeding its plaintext toc file (restore-dump.sql)
to psql. The output is also suitable for manipulating the files with standard
editing tools.

This patch is a WIP (V1). The patch is against master and it compiles
successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64.
To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname

What do you think of this feature, any chance it will be added to pg_dump once
the patch is ready?
Is the chosen name "structured" appropriate?

Thanks for any feedback.

--
Attila Soki

Tom Lane

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Attila Soki (#1)

Re: WIP Patch: pg_dump structured

Attila Soki <pgsql@attilasoki.com> writes:

This patch adds the structured output format to pg_dump.
This format is a plaintext output split up into multiple files and the
resulting small files are stored in a directory path based on the dumped object.

Won't this fail completely with SQL objects whose names aren't suitable
to be pathname components? "A/B" is a perfectly good name so far as
SQL is concerned. You could also have problems with collisions on
case-insensitive filesystems.

This format can be restored by feeding its plaintext toc file (restore-dump.sql)
to psql. The output is also suitable for manipulating the files with standard
editing tools.

This seems a little contradictory: if you want to edit the individual
files, you'd have to also update restore-dump.sql, or else it's pointless.
It might make more sense to consider this as a write-only dump format
and not worry about whether it can be restored directly.

What do you think of this feature, any chance it will be added to pg_dump once
the patch is ready?

I'm not clear on how big the use-case is. It's not really obvious to
me that this'd have any benefit over the existing plain-text dump
capability. You can edit those files too, at least till the schema
gets too big for your editor. (But if you've got many many thousand
SQL objects, a file-per-SQL-object directory will also be no fun to
deal with.)

regards, tom lane

Attila Soki

pgsql@attilasoki.com

about 3 years ago

In reply to: Tom Lane (#2)

Re: WIP Patch: pg_dump structured

On 12 Mar 2023, at 21:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Attila Soki <pgsql@attilasoki.com> writes:

This patch adds the structured output format to pg_dump.
This format is a plaintext output split up into multiple files and the
resulting small files are stored in a directory path based on the dumped object.

Won't this fail completely with SQL objects whose names aren't suitable
to be pathname components? "A/B" is a perfectly good name so far as
SQL is concerned. You could also have problems with collisions on
case-insensitive filesystems.

The “A/B” case is handled in _CleanFilename function, the slash and other
problematic characters are replaced.
You are right about the case-insensivity, this is not handled and will fail. I forgot
to handle that. I trying to find a way to handle this.

This format can be restored by feeding its plaintext toc file (restore-dump.sql)
to psql. The output is also suitable for manipulating the files with standard
editing tools.

This seems a little contradictory: if you want to edit the individual
files, you'd have to also update restore-dump.sql, or else it's pointless.
It might make more sense to consider this as a write-only dump format
and not worry about whether it can be restored directly.

The main motivation was to track changes with VCS at the file (object) level,
editing small files was intended as a second possible use case.
I did not know that a write-only format would go.

What do you think of this feature, any chance it will be added to pg_dump once
the patch is ready?

I'm not clear on how big the use-case is. It's not really obvious to
me that this'd have any benefit over the existing plain-text dump
capability. You can edit those files too, at least till the schema
gets too big for your editor. (But if you've got many many thousand
SQL objects, a file-per-SQL-object directory will also be no fun to
deal with.)

I use something like this (a previous version) to track several thousand
objects. But I'm not sure if that would have a wide user base.
Therefore the wip to see if there is interest in this feature.
I think the advantage of having many small files is that it is recognizable
which file (object) is involved in a commit and that the SQL functions and
tables get a change history.

Thank you for your feedback.

Regards,
Attila Soki

Attila Soki

pgsql@attilasoki.com

about 3 years ago

In reply to: Attila Soki (#3)

Re: WIP Patch: pg_dump structured

On 12 Mar 2023, at 22:56, Attila Soki <pgsql@attilasoki.com> wrote:

On 12 Mar 2023, at 21:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Won't this fail completely with SQL objects whose names aren't suitable
to be pathname components? "A/B" is a perfectly good name so far as
SQL is concerned. You could also have problems with collisions on
case-insensitive filesystems.

You are right about the case-insensivity, this is not handled and will fail. I forgot
to handle that. I trying to find a way to handle this.

Hi Tom,

Thank you for your feedback.

This is an updated version of the pg_dump structured wip patch (V2) with the
following changes:
- to avoid path collisions on case insensitive filessystems, all path components
created from user input are suffixed with the hex representation of a 32 bit
hash. “A/B” and “a/b” will get different suffixes.
- all path components are now filesystem safe

All this is a proposal, if you know a better solution please let me know.

This patch is a WIP (V2). The patch is against master and it compiles
successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64.
To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname

This format can be restored by feeding its plaintext toc file (restore-dump.sql)
to psql. The output is also suitable for manipulating the files with standard
editing tools.

This seems a little contradictory: if you want to edit the individual
files, you'd have to also update restore-dump.sql, or else it's pointless.
It might make more sense to consider this as a write-only dump format
and not worry about whether it can be restored directly.

The main motivation was to track changes with VCS at the file (object) level,
editing small files was intended as a second possible use case.
I did not know that a write-only format would go.

Declaring this format as a write-only dump would allow a more flexible directory
structure since we wouldn't have to maintain the restore order.

What do you think of this feature, any chance it will be added to pg_dump once
the patch is ready?

I'm not clear on how big the use-case is. It's not really obvious to
me that this'd have any benefit over the existing plain-text dump
capability. You can edit those files too, at least till the schema
gets too big for your editor. (But if you've got many many thousand
SQL objects, a file-per-SQL-object directory will also be no fun to
deal with.)

Here is a sample use case to demonstrate how this format could be used to track
schema changes with git. The main difference from using the existing plain-text
schema dump is, that this format makes it possible to keep a history of the
actual changes made to the individual objects. For example, to determine which
migrations have changed the foo function.

# import the schema into the repository
cd /path/to/my_app_code
pg_dump --format=structured --schema-only --file=foo_schema foodb
git add foo_schema --all
git commit foo_schema -m'initial commit foo_schema'

# make changes in the db
(my_app migrate foodb)
(psql foodb < tweak.sql)

# get a fresh dump
rm -rf foo_schema
pg_dump --format=structured --schema-only --file=foo_schema foodb

# now inspect the changes under foo_schema: there may be changed, new and
# missing files
git status foo_schema

# commit all schema changes
git add foo_schema -u
git commit foo_schema -m'changes from migration foodb'

# later, inspect changes
git log --stat

# show the history of one object
git log -p -- "foo_schema/path/to/FUNCTIONS/foo.sql"

Sure, the user base for this is narrow.

Thanks for any feedback.

—
Best regards
Attila Soki

WIP Patch: pg_dump structured

Attachments:

Attachments: