pg_dump -Fd must create directory

Started by François Beausoleilover 13 years ago3 messagesgeneral

francois@teksol.info

over 13 years ago

Hi all!

Why must pg_dump create a fresh new directory everytime? I'm running some tests where I dump a database to a directory, git init and git add --all, then dump again. When I did that after doing some modifications (specifically creating a new table and adding a few hundred thousand records), git status told me some files had been renamed and others were removed. I worked around this by dumping to a new directory, then moving .git manually.

My real-world use case is I have a largish database (46 GiB gzip'd dump) that I dump every few hours, and some tables are essentially static. I was thinking of saving some network traffic by transferring only the delta.

Any thoughts on this? Is this something that can or can be made to change? Where are the IDs used in the dump coming from? Can they be made stable?

Thanks!
François Beausoleil

$ cat a.sh
PGDATABASE=${USER}
rm -rf thedump thedump2
psql -c "select * into a from generate_series(1, 100000, 1) as t1(a)"
pg_dump -Fd --file=thedump
cd thedump && git init && git add --all . && git commit -m "initial" && cd ..
psql -c "select * into b from generate_series(1, 100000, 1) as t1(b)"
pg_dump -Fd --file=thedump2
mv thedump/.git thedump2/
cd thedump2
git status --short

$ sh a.sh
SELECT 100000
Initialized empty Git repository in /root/tmp/thedump/.git/
[master (root-commit) 9f8dc9f] initial
2 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 1882.dat.gz
create mode 100644 toc.dat
SELECT 100000
D 1882.dat.gz
M toc.dat
?? 1886.dat.gz
?? 1887.dat.gz

David Salisbury

salisbury@globe.gov

over 13 years ago

In reply to: François Beausoleil (#1)

Re: pg_dump -Fd must create directory

It looks to me like you're misusing git..

You should only git init once, and always use that directory.
Then pg_dump, which should create one file per database
with the file name you've specified.
Not sure of the flags but I'd recommend plain text format.

I'm also unsure what you mean by network traffic, as you don't
mention a remote repository, but there nice visual tools
for you to see the changes to files between you're committed
objects. git init.. will more than likely lose all changes
to files.

-ds

Show quoted text

On 9/12/12 5:12 AM, François Beausoleil wrote:

Hi all!

Why must pg_dump create a fresh new directory everytime? I'm running some tests where I dump a database to a directory, git init and git add --all, then dump again. When I did that after doing some modifications (specifically creating a new table and adding a few hundred thousand records), git status told me some files had been renamed and others were removed. I worked around this by dumping to a new directory, then moving .git manually.

My real-world use case is I have a largish database (46 GiB gzip'd dump) that I dump every few hours, and some tables are essentially static. I was thinking of saving some network traffic by transferring only the delta.

Any thoughts on this? Is this something that can or can be made to change? Where are the IDs used in the dump coming from? Can they be made stable?

Thanks!
François Beausoleil

$ cat a.sh
PGDATABASE=${USER}
rm -rf thedump thedump2
psql -c "select * into a from generate_series(1, 100000, 1) as t1(a)"
pg_dump -Fd --file=thedump
cd thedump&& git init&& git add --all .&& git commit -m "initial"&& cd ..
psql -c "select * into b from generate_series(1, 100000, 1) as t1(b)"
pg_dump -Fd --file=thedump2
mv thedump/.git thedump2/
cd thedump2
git status --short

$ sh a.sh
SELECT 100000
Initialized empty Git repository in /root/tmp/thedump/.git/
[master (root-commit) 9f8dc9f] initial
2 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 1882.dat.gz
create mode 100644 toc.dat
SELECT 100000
D 1882.dat.gz
M toc.dat
?? 1886.dat.gz
?? 1887.dat.gz

François Beausoleil

francois@teksol.info

over 13 years ago

In reply to: David Salisbury (#2)

Re: pg_dump -Fd must create directory

Le 2012-09-13 à 16:51, David Salisbury a écrit :

It looks to me like you're misusing git..

You should only git init once, and always use that directory.
Then pg_dump, which should create one file per database
with the file name you've specified.
Not sure of the flags but I'd recommend plain text format.

I'm also unsure what you mean by network traffic, as you don't
mention a remote repository, but there nice visual tools
for you to see the changes to files between you're committed
objects. git init.. will more than likely lose all changes
to files.

I was just running a test: looking at a way to transfer large amounts of data for backup purposes with a tool that's especially suited for deltas. I know about rsync, but this was a thought experiment. I was only surprised at the restriction of pg_dump that must create a new directory every time. Was looking for a rationale.

Also, git init is a safe operation: within a repository, git init says it reinitialized, but does not lose files. Haven't tried with local changes, or a dirty index.

Finally, when NOT using the plain text format, pg_restore can restore more than one table at a time, using the --jobs flag. On a multi-core, multi-spindle machine, this can cut down the restore time tremendously.

Bye,
François