pg_dump & blobs - editable dump?
I have been speaking to Pavel about pg_dump support of blobs, and he thinks
it is important to allow for some kind of human-readable version of the
dump to be created.
My guess is that this will involve a plain text schema dump, followed by
all BLOBs in separate files, and a script to load them. To implement this
I'll obviosly need to be passed a directory/file location for the script
since I can't pipe seperate files to stdout.
I'd be interested in knowing what features people think are important in
this kind of format; what do you need to do with the blob files, what do
peple want to edit, etc etc.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Why not have it using something like tar, and the first file being stored in
ascii?
That way, you could extract easily the human readable SQL but still pipe the
blobs to stdout.
Peter
--
Peter Mount
Enterprise Support
Maidstone Borough Council
Any views stated are my own, and not those of Maidstone Borough Council
-----Original Message-----
From: Philip Warner [mailto:pjw@rhyme.com.au]
Sent: Wednesday, July 12, 2000 2:51 PM
To: pgsql-hackers@postgresql.org; pgsql-general@postgresql.org
Cc: Pavel@Janik.cz
Subject: [HACKERS] pg_dump & blobs - editable dump?
I have been speaking to Pavel about pg_dump support of blobs, and he thinks
it is important to allow for some kind of human-readable version of the
dump to be created.
My guess is that this will involve a plain text schema dump, followed by
all BLOBs in separate files, and a script to load them. To implement this
I'll obviosly need to be passed a directory/file location for the script
since I can't pipe seperate files to stdout.
I'd be interested in knowing what features people think are important in
this kind of format; what do you need to do with the blob files, what do
peple want to edit, etc etc.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Import Notes
Resolved by subject fallback
Philip Warner wrote:
My guess is that this will involve a plain text schema dump, followed by
all BLOBs in separate files, and a script to load them. To implement this
I'll obviosly need to be passed a directory/file location for the script
since I can't pipe seperate files to stdout.
uuencode the blobs, perhaps, using a shar-like format?
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
At 14:58 12/07/00 +0100, Peter Mount wrote:
Why not have it using something like tar, and the first file being stored in
ascii?That way, you could extract easily the human readable SQL but still pipe the
blobs to stdout.
Has Tom Lane paid you to send this message? :-}
If anyone can send me a nice interface for reading and writing a tar file
from C, I'll do it. I just don't have the inclination to learn about tar
internals at the moment. By 'nice' I mean that I would like:
- to be able to create the archive and write files sequentially using
something similar to fopen/fwrite/fclose.
- open an archive and examine and read files sequentially using a similar
interface to opendir/readdir/fopen/fread/fclose.
- Ideally open a specified file in the archive by name, but if not
possible, then it should be easy using the 'opedir' function above.
This would be a very useful library, I am sure. It also needs to be
licensable under BSD to go into the PG distribution.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
At 10:13 12/07/00 -0400, Lamar Owen wrote:
Philip Warner wrote:
My guess is that this will involve a plain text schema dump, followed by
all BLOBs in separate files, and a script to load them. To implement this
I'll obviosly need to be passed a directory/file location for the script
since I can't pipe seperate files to stdout.uuencode the blobs, perhaps, using a shar-like format?
For the human readable version, the request was to make it editable and
sendable to psql. As a result the BLOBs need to be in their binary format
OR psql needs to support BLOB import from stdin. As a first pass I was
hoping for the simple 'dump them into files' solution.
What I am confused by is what people actually want to do with a load of
BLOBs sitting in a directory; if there are specific needs, then I'd also
like to cater for them in the custom file formats.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
On Thu, Jul 13, 2000 at 12:17:28AM +1000, Philip Warner wrote:
At 14:58 12/07/00 +0100, Peter Mount wrote:
Why not have it using something like tar, and the first file being stored in
ascii?That way, you could extract easily the human readable SQL but still pipe the
blobs to stdout.If anyone can send me a nice interface for reading and writing a tar file
from C, I'll do it. I just don't have the inclination to learn about tar
internals at the moment. By 'nice' I mean that I would like:
i suspect you might find a library of either tar or cpio read functions as
part of the FreeBSD sysinstall utility.
--
[ Jim Mercer jim@reptiles.org +1 416 410-5633 ]
[ Reptilian Research -- Longer Life through Colder Blood ]
[ Don't be fooled by cheap Finnish imitations; BSD is the One True Code. ]
No he didn't, just I've been sort of lurking on this subject ;-)
Actually, tar files are simply a small header, followed by the file's
contents. To add another file, you simply write another header, and contents
(which is why you can cat two tar files together and get a working file).
http://www.goice.co.jp/member/mo/formats/tar.html has a nice brief
description of the header.
As for a C api with a compatible licence, if needs must I'll write one to
your spec (maidast should be back online in a couple of days, so I'll be
back in business development wise).
Peter
--
Peter Mount
Enterprise Support
Maidstone Borough Council
Any views stated are my own, and not those of Maidstone Borough Council
-----Original Message-----
From: Philip Warner [mailto:pjw@rhyme.com.au]
Sent: Wednesday, July 12, 2000 3:17 PM
To: Peter Mount; pgsql-hackers@postgresql.org;
pgsql-general@postgresql.org
Cc: Pavel@Janik.cz
Subject: Re: [GENERAL] RE: [HACKERS] pg_dump & blobs - editable dump?
At 14:58 12/07/00 +0100, Peter Mount wrote:
Why not have it using something like tar, and the first file being stored
in
ascii?
That way, you could extract easily the human readable SQL but still pipe
the
blobs to stdout.
Has Tom Lane paid you to send this message? :-}
If anyone can send me a nice interface for reading and writing a tar file
from C, I'll do it. I just don't have the inclination to learn about tar
internals at the moment. By 'nice' I mean that I would like:
- to be able to create the archive and write files sequentially using
something similar to fopen/fwrite/fclose.
- open an archive and examine and read files sequentially using a similar
interface to opendir/readdir/fopen/fread/fclose.
- Ideally open a specified file in the archive by name, but if not
possible, then it should be easy using the 'opedir' function above.
This would be a very useful library, I am sure. It also needs to be
licensable under BSD to go into the PG distribution.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Import Notes
Resolved by subject fallback
Which is why having them on stdout is still a nice option to have. You can
pipe the lot through your favourite compressor (gzip, bzip2 etc) and
straight on to tape, or whatever.
I don't know why you would want them as separate files - just think what
would happen to directory search times!!
How about this as an idea:
* Option to dump sql to stdout and blobs to a designated file
* option to dump sql & blobs to stdout
* option to dump just sql to stdout
* option to dump just blobs to stdout
That way (depending on the database design), you could handle the sql &
blobs separately but still have everything backed up.
PS: Backups is formost on my mind at the moment - had an NT one blow up in
my face on Monday and it wasn't nice :-(
Peter
--
Peter Mount
Enterprise Support
Maidstone Borough Council
Any views stated are my own, and not those of Maidstone Borough Council
-----Original Message-----
From: Philip Warner [mailto:pjw@rhyme.com.au]
Sent: Wednesday, July 12, 2000 3:22 PM
To: Lamar Owen
Cc: pgsql-hackers@postgresql.org; pgsql-general@postgresql.org;
Pavel@Janik.cz
Subject: Re: [HACKERS] pg_dump & blobs - editable dump?
At 10:13 12/07/00 -0400, Lamar Owen wrote:
Philip Warner wrote:
My guess is that this will involve a plain text schema dump, followed by
all BLOBs in separate files, and a script to load them. To implement this
I'll obviosly need to be passed a directory/file location for the script
since I can't pipe seperate files to stdout.uuencode the blobs, perhaps, using a shar-like format?
For the human readable version, the request was to make it editable and
sendable to psql. As a result the BLOBs need to be in their binary format
OR psql needs to support BLOB import from stdin. As a first pass I was
hoping for the simple 'dump them into files' solution.
What I am confused by is what people actually want to do with a load of
BLOBs sitting in a directory; if there are specific needs, then I'd also
like to cater for them in the custom file formats.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Import Notes
Resolved by subject fallback
Philip Warner wrote:
At 10:13 12/07/00 -0400, Lamar Owen wrote:
Philip Warner wrote:
I'll obviosly need to be passed a directory/file location for the script
since I can't pipe seperate files to stdout.uuencode the blobs, perhaps, using a shar-like format?
For the human readable version, the request was to make it editable and
sendable to psql. As a result the BLOBs need to be in their binary format
OR psql needs to support BLOB import from stdin. As a first pass I was
hoping for the simple 'dump them into files' solution.
If in a shell archive format, shouldn't it be easy enough for pg_restore
to be made to do the stdin-to-blob thing (through whatever mechanisms
you're already using to get the blob back in in the first place,
combined with some steering/deshar-ing/uudecoding logic)? The backup
could even be made 'self-extracting' as shars usually are... :-) Of
course, you then have to be on the watch for the usual shar trojans...
If we simply know that the backup cannot be sent to psql, but a
deshar-ed version can have the schema sent to psql, would that
ameliorate most concerns?
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
At 15:32 12/07/00 +0100, Peter Mount wrote:
Which is why having them on stdout is still a nice option to have. You can
pipe the lot through your favourite compressor (gzip, bzip2 etc) and
straight on to tape, or whatever.
Well, the custom format does that, it also does compression and can go to
stdout.
I don't know why you would want them as separate files - just think what
would happen to directory search times!!
I agree; the request was based on a desire to do something like pg_dump_lo,
which puts them all in a directory, I think.
How about this as an idea:
* Option to dump sql to stdout and blobs to a designated file
* option to dump sql & blobs to stdout
* option to dump just sql to stdout
* option to dump just blobs to stdout
The sql is *tiny* compared to most BLOB contents. The new pg_dump currently
supports:
* schema, table data, & blobs
* schema, table data
* schema
* table data & blobs
* table data
BLOBS without table data are not recomended since the process of relinking
the BLOBs to the tables is *only* performed on tables that are restored.
This is to allow import of BLOBS & tables into existing DBs. As a result
your fourth option is not really an option. The other three are already
covered.
Any single-file format (tar would be one of those) can be sent to stdout,
and BLOBs are not supported in plain-text output (for obvious reasons).
That way (depending on the database design), you could handle the sql &
blobs separately but still have everything backed up.
Unfortunately the data and BLOBS need to go together.
PS: Backups is formost on my mind at the moment - had an NT one blow up in
my face on Monday and it wasn't nice :-(
With the current version you should be able to do:
pg_dump -Fc --blobs | /dev/myfavoritetapedrive
to backup the entire database, with compressed data, to tape.
And
cat /dev/mt | pg_restore --db=dbname
to restore the entire db into the specified database
Or,
pg_dump -Fc --blobs | pg_restore --db=dbname
to copy a database with blobs...
So, in summary, I think most of what you want is already there. It's just
the human-readable part that's a problem.
*Please* let me know if there is some issue I have not considered...
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
If anyone can send me a nice interface for reading and writing a tar file
from C, I'll do it. I just don't have the inclination to learn about tar
internals at the moment. By 'nice' I mean that I would like:
I don't know the details of the API, but the NetBSD pax code handles
tar formats (and others) nicely and on a cursory glance seems to have
what you need. Of course, the license is acceptable. If you want the
source, let me know.
Cheers,
Brook
At 15:25 12/07/00 +0100, Peter Mount wrote:
No he didn't, just I've been sort of lurking on this subject ;-)
Actually, tar files are simply a small header, followed by the file's
contents. To add another file, you simply write another header, and contents
(which is why you can cat two tar files together and get a working file).http://www.goice.co.jp/member/mo/formats/tar.html has a nice brief
description of the header.
Damn! I knew someone would call my bluff.
As you say, it looks remarkably simple.
A couple of questions:
136 12 bytes Modify time (in octal ascii)
...do you know the format of the date (seconds since 1970?).
157 100 bytes Linkname ('\0' terminated, 99 maxmum length)
...what's this? Is it the target for symlinks?
329 8 bytes Major device ID (in octal ascii)
337 8 bytes Minor device ID (in octal ascii)345 167 bytes
Padding
...and what should I set these to?
As for a C api with a compatible licence, if needs must I'll write one to
your spec (maidast should be back online in a couple of days, so I'll be
back in business development wise).
If you're serious about the offer, I'd be happy. But, given how simple the
format is, I can probably tack in into place myself.
There is a minor problem. Currently I compress the output stream as I
receive it from PG, and send it to the archive. I don't know how big it
will be until it is written. The custom output format can handle this, but
in streaming a tar file to tape, I have to know the file size first. This
means writing to /tmp. I supose that's OK, but I've been trying to avoid it.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
At 10:38 12/07/00 -0400, Lamar Owen wrote:
If we simply know that the backup cannot be sent to psql, but a
deshar-ed version can have the schema sent to psql, would that
ameliorate most concerns?
In the current version
pg_restore --schema
will send the schema to stdout
Is that sufficient? Or are you strictly interested in the text output side
of things?
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Philip Warner wrote:
will send the schema to stdout
Is that sufficient? Or are you strictly interested in the text output side
of things?
Strictly interested in the text output side of things, for various
not-necessarily-good reasons (:-)).
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
http://www.goice.co.jp/member/mo/formats/tar.html has a nice brief
Best is to look at one of the actual standards, accessible via:
The tar and cpio formats are in the pax specification.
136 12 bytes Modify time (in octal ascii)
...do you know the format of the date (seconds since 1970?).
It's just 11 bytes plus \0 in tar's usual encode-this-as-octal format:
encode_octal(unsigned char *p, size_t n, unsigned long value)
{
const unsigned char octal[] = "01234567";
while (n) {
*(p + --n) = octal[value & 07];
value >>= 3;
}
}
Warning: some values allowed by tar exceed the size of 'long' on a 32
bit platform.
157 100 bytes Linkname ('\0' terminated, 99 maxmum length)
...what's this? Is it the target for symlinks?
Long pathnames get split into two pieces on a '/' as I recall.
The code I offered you previously has code to do this too; I
appreciate that the code is quite likely not what you want, but you
might consider looking at it or other tar/pax code to help you
interpret the standard.
329 8 bytes Major device ID (in octal ascii)
337 8 bytes Minor device ID (in octal ascii)
345 167 bytes Padding...and what should I set these to?
Zero.
If you're serious about the offer, I'd be happy. But, given how simple the
format is, I can probably tack in into place myself.
For the very limited formats you want to create, that's probably
the easiest way. You don't care about unpacking, GNU v. POSIX format,
device files, etc etc.
There is a minor problem. Currently I compress the output stream as I
receive it from PG, and send it to the archive. I don't know how big it
will be until it is written. The custom output format can handle this, but
in streaming a tar file to tape, I have to know the file size first. This
means writing to /tmp. I supose that's OK, but I've been trying to
avoid it.
I recommend you compress the whole stream, not the pieces. Presumably
you can determine the size of the pieces you're backing up, and ending
with a .tar.gz (or whatever) file is more convenient to manage than a
.tar file of compressed pieces unless you really expect people to be
extracting individual files from the backup very often.
Having to pass everything through /tmp would be really unfortunate.
Regards,
Giles
[This is a multi-reply, CCs and -general removed.]
From: Peter Mount <petermount@it.maidstone.gov.uk>
Date: Wed, 12 Jul 2000 14:58:50 +0100
Hi,
Why not have it using something like tar, and the first file being
stored in ascii?
some filesystems do not allow you to have files bigger then 2G :-( I do not
think that one file (even gzipped tar file) is good.
From: Peter Mount <petermount@it.maidstone.gov.uk>
Date: Wed, 12 Jul 2000 15:32:10 +0100
I don't know why you would want them as separate files - just think what
would happen to directory search times!!
No problem, you can use one index file and hashes in it so files are then
stored as:
AA/AA/AA/00
AA/AA/AA/01
See how squid (http://www.squid-proxy.org/) does his job here. No problem,
I think. I really prefer this solution over one big file. You can easily
swap files with other databases, you can even create md5sum of md5sums of
each file so you can have a multi-md5sum of your database (you can be
really sure that your backup is OK, etc. :-).
That way (depending on the database design), you could handle the sql
& blobs separately but still have everything backed up.PS: Backups is formost on my mind at the moment - had an NT one blow
up in my face on Monday and it wasn't nice :-(
No one (I hope) is arguing about the need for backing BLOBs from the DB :-)
--
Pavel Jan�k ml.
Pavel.Janik@linux.cz
Import Notes
Reply to msg id not found: PeterMountsmessageofWed12Jul2000145850+0100
At 07:58 13/07/00 +1000, Giles Lean wrote:
I recommend you compress the whole stream, not the pieces. Presumably
you can determine the size of the pieces you're backing up, and ending
with a .tar.gz (or whatever) file is more convenient to manage than a
.tar file of compressed pieces unless you really expect people to be
extracting individual files from the backup very often.Having to pass everything through /tmp would be really unfortunate.
The only things I compress are the table data and the blobs (ie. the big
things); unfortunately, the table data is of unknown uncompressed size. I
*could* do two 'COPY TO STDOUT' calls, just to get the size, but that seems
like a very bad idea.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/