pg_dump directory archive format / parallel pg_dump

Started by Joachim Wielandover 15 years ago27 messageshackers

joe@mcknight.de

over 15 years ago

Here's a new series of patches for the parallel dump/restore. They need to be
applied on top of each other.

The parallel pg_dump patch does not yet use the synchronized snapshot
functionality from my other patch to not create more dependencies than
necessary.

(1) pg_dump directory archive format (without checks as requested by Heikki)
(2) parallel pg_dump
(3) checks for the directory archive format

Joachim

Jaime Casanova

jcasanov@systemguards.com.ec

over 15 years ago

In reply to: Joachim Wieland (#1)

Re: pg_dump directory archive format / parallel pg_dump

On Fri, Jan 7, 2011 at 3:18 PM, Joachim Wieland <joe@mcknight.de> wrote:

Here's a new series of patches for the parallel dump/restore. They need to be
applied on top of each other.

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

Joachim Wieland

joe@mcknight.de

over 15 years ago

In reply to: Jaime Casanova (#2)

Re: pg_dump directory archive format / parallel pg_dump

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

Joachim

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Joachim Wieland (#3)

Re: pg_dump directory archive format / parallel pg_dump

On 19.01.2011 07:45, Joachim Wieland wrote:

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova<jaime@2ndquadrant.com> wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.

It would be nice to have this format match the tar format. At the
moment, there's a couple of cosmetic differences:

* TOC file is called "TOC", instead of "toc.dat"

* blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"

* each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

The only significant difference is that in the directory archive format,
each data file has a header in the beginning.

What are the benefits of the data file header? Would it be better to
leave it out, so that the format would be identical to the tar format?
You could then just tar up the directory to get a tar archive, or vice
versa.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Joachim Wieland

joe@mcknight.de

over 15 years ago

In reply to: Heikki Linnakangas (#4)

Re: pg_dump directory archive format / parallel pg_dump

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

It would be nice to have this format match the tar format. At the moment,
there's a couple of cosmetic differences:

* TOC file is called "TOC", instead of "toc.dat"

* blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"

* each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

That can be done easily...

The only significant difference is that in the directory archive format,
each data file has a header in the beginning.

What are the benefits of the data file header? Would it be better to leave
it out, so that the format would be identical to the tar format? You could
then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

If you want to drop the header altogether, fine with me but if it's
just for the tar <-> directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

Joachim

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Joachim Wieland (#5)

Re: pg_dump directory archive format / parallel pg_dump

On 19.01.2011 16:01, Joachim Wieland wrote:

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

Ok. I'm going to leave out the filesize. I can see some value in that,
and the CRC, but I don't want to add stuff that's not used at this point.

It would be nice to have this format match the tar format. At the moment,
there's a couple of cosmetic differences:

* TOC file is called "TOC", instead of "toc.dat"

* blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"

* each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

That can be done easily...

The only significant difference is that in the directory archive format,
each data file has a header in the beginning.

What are the benefits of the data file header? Would it be better to leave
it out, so that the format would be identical to the tar format? You could
then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file
format issue has been thought of already: there's still code there to
add .gz suffix for compressed files. How about adopting that convention
in the directory format too? That would make an uncompressed directory
format compatible with the tar format.

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

Now that we have an API for compression in compress_io.c, it probably
wouldn't be very hard to implement the missing compression support to
tar format either.

If you want to drop the header altogether, fine with me but if it's
just for the tar<-> directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to
me if you can do things like parallel pg_dump in directory format first,
and then tar it up to a file for archival.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Joachim Wieland

joe@mcknight.de

over 15 years ago

In reply to: Heikki Linnakangas (#6)

Re: pg_dump directory archive format / parallel pg_dump

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

Ok. I'm going to leave out the filesize. I can see some value in that, and
the CRC, but I don't want to add stuff that's not used at this point.

Okay.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file format
issue has been thought of already: there's still code there to add .gz
suffix for compressed files. How about adopting that convention in the
directory format too? That would make an uncompressed directory format
compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

This is a valid use case for people who are concerned with a fast
dump, usually they would dump uncompressed and later compress the
archive. However once we have parallel pg_dump, this advantage
vanishes.

Now that we have an API for compression in compress_io.c, it probably
wouldn't be very hard to implement the missing compression support to tar
format either.

True, but the question to the advantage of the tar format remains :-)

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to me
if you can do things like parallel pg_dump in directory format first, and
then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?

Joachim

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Joachim Wieland (#7)

Re: pg_dump directory archive format / parallel pg_dump

On 20.01.2011 15:46, Joachim Wieland wrote:

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file format
issue has been thought of already: there's still code there to add .gz
suffix for compressed files. How about adopting that convention in the
directory format too? That would make an uncompressed directory format
compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

I'm not sure either. Maybe you want to pipe the output of "pg_dump -F t"
via an ssh tunnel to another host, where you untar it, producing a
directory format dump. You can then edit the directory format dump, and
restore it back to the database without having to tar it again.

It gives you a lot of flexibility if the formats are compatible, which
is generally good.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?

Hmm, I didn't realize pg_restore requires the files to be in certain
order in the tar file. There's no mention of that in the docs either, we
should add that. It doesn't actually require that if you read from a
file, but from stdin it does.

You can put files in the archive in a certain order if you list them
explicitly in the tar command line, like "tar cf backup.tar toc.dat
...". It's hard to know the right order, though. In practice you would
need to do "tar tf backup.tar >files" before untarring, and use "files"
to tar them again in the rightorder.

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

Yeah, we should write the header too. That's not hard, e.g gzopen will
do that automatically, or you can pass a flag to deflateInit2.

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to me
if you can do things like parallel pg_dump in directory format first, and
then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?

See above, you can unless you try to pipe it to pg_restore. In fact,
that's listed as an advantage of the tar format over other formats in
the pg_dump documentation.

(I'm working on this, no need to submit a new patch)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Florian Pflug

fgp@phlo.org

over 15 years ago

In reply to: Heikki Linnakangas (#8)

Re: pg_dump directory archive format / parallel pg_dump

On Jan20, 2011, at 16:22 , Heikki Linnakangas wrote:

You can put files in the archive in a certain order if you list them explicitly in the tar command line, like "tar cf backup.tar toc.dat ...". It's hard to know the right order, though. In practice you would need to do "tar tf backup.tar >files" before untarring, and use "files" to tar them again in the rightorder.

Hm, could we create a file in the backup directory which lists the files in the right order?

best regards,
Florian Pflug

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Heikki Linnakangas (#8)

Re: pg_dump directory archive format / parallel pg_dump

On 20.01.2011 17:22, Heikki Linnakangas wrote:

(I'm working on this, no need to submit a new patch)

Ok, here's a heavily refactored version of this (also available at
git://git.postgresql.org/git/users/heikki/postgres.git, branch
pg_dump_directory). The directory format is now identical to the tar
format, except that in the directory format the files can be compressed.
Also we don't write the restore.sql file - it would be nice to have, but
pg_restore doesn't require it. We can leave that as a TODO.

I ended up writing another compression abstraction layer in
compress_io.c. It wraps fopen / gzopen etc. in a common API, so that the
caller doesn't need to care if the file is compressed or not. In
hindsight, the compression API we put in earlier didn't suit us very
well. But I guess it wasn't a complete waste, as it moved the gory
details of zlib out of the custom format code.

If compression is used, the files are created with the .gz suffix, and
include the gzip header so that you can manipulate them easily with
gzip/gunzip utilities. When reading, we accept files with or without the
.gz suffix, and you can have some files compressed and others uncompressed.

I haven't updated the documentation yet.

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate
to introduce a parallel --dir option just for this. Any thoughts on this?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Heikki Linnakangas (#10)

Re: pg_dump directory archive format / parallel pg_dump

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate to
introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output. But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

-f, --file=FILENAME output file (or directory) name
-F, --format=c|t|p|d output file format (custom, tar, text, dir)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Robert Haas (#11)

Re: pg_dump directory archive format / parallel pg_dump

On 21.01.2011 15:35, Robert Haas wrote:

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate to
introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output. But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

-f, --file=FILENAME output file (or directory) name
-F, --format=c|t|p|d output file format (custom, tar, text, dir)

Ok, that's exactly what the patch does now. I guess it's fine then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

Andrew Dunstan

andrew@dunslane.net

over 15 years ago

In reply to: Heikki Linnakangas (#12)

Re: pg_dump directory archive format / parallel pg_dump

On 01/21/2011 10:34 AM, Heikki Linnakangas wrote:

On 21.01.2011 15:35, Robert Haas wrote:

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd
hate to
introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output. But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

-f, --file=FILENAME output file (or directory) name
-F, --format=c|t|p|d output file format (custom, tar, text,
dir)

Ok, that's exactly what the patch does now. I guess it's fine then.

Maybe we could change the hint to say "--file=DESTINATION" or
"--file=FILENAME|DIRNAME" ?

Just a thought.

cheers

andrew

#14

Euler Taveira de Oliveira

euler@timbira.com

over 15 years ago

In reply to: Andrew Dunstan (#13)

Re: pg_dump directory archive format / parallel pg_dump

Em 21-01-2011 12:47, Andrew Dunstan escreveu:

Maybe we could change the hint to say "--file=DESTINATION" or
"--file=FILENAME|DIRNAME" ?

... "--file=OUTPUT" or "--file=OUTPUTNAME".

--
Euler Taveira de Oliveira
http://www.timbira.com/

#15

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Euler Taveira de Oliveira (#14)

Re: pg_dump directory archive format / parallel pg_dump

On 21.01.2011 19:11, Euler Taveira de Oliveira wrote:

Em 21-01-2011 12:47, Andrew Dunstan escreveu:

Maybe we could change the hint to say "--file=DESTINATION" or
"--file=FILENAME|DIRNAME" ?

... "--file=OUTPUT" or "--file=OUTPUTNAME".

Ok, works for me.

I've committed this patch now, with a whole bunch of further fixes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Joachim Wieland (#3)

Re: pg_dump directory archive format / parallel pg_dump

On Wed, Jan 19, 2011 at 12:45 AM, Joachim Wieland <joe@mcknight.de> wrote:

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies. Please rebase.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Joachim Wieland

joe@mcknight.de

over 15 years ago

In reply to: Robert Haas (#16)

Re: pg_dump directory archive format / parallel pg_dump

On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies. Please rebase.

Here is a rebased version with some minor changes as well. I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.

Joachim

#18

Itagaki Takahiro

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Joachim Wieland (#17)

Re: pg_dump directory archive format / parallel pg_dump

On Wed, Feb 2, 2011 at 13:32, Joachim Wieland <joe@mcknight.de> wrote:

Here is a rebased version with some minor changes as well.

I read the patch works as below. Am I understanding correctly?
1. Open all connections in a parent process.
2. Start transactions for each connection in the parent.
3. Spawn child processes with fork().
4. Each child process uses one of the inherited connections.

I think we have 2 important technical issues here:
* The consistency is not perfect. Each transaction is started
with small delays in step 1, but we cannot guarantee no other
transaction between them.
* Can we inherit connections to child processes with fork() ?
Moreover, we also need to pass running transactions to children.
I wonder libpq is designed for such usage.

To solve both issues, we might want a way to control visibility
in a database server instead of client programs. Don't we need
server-side support like [1]http://wiki.postgresql.org/wiki/ClusterFeatures#Export_snapshots_to_other_sessions before developing parallel dump?
[1]: http://wiki.postgresql.org/wiki/ClusterFeatures#Export_snapshots_to_other_sessions

I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

--
Itagaki Takahiro

#19

Joachim Wieland

joe@mcknight.de

over 15 years ago

In reply to: Itagaki Takahiro (#18)

Re: pg_dump directory archive format / parallel pg_dump

On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

I think we have 2 important technical issues here:
* The consistency is not perfect. Each transaction is started
with small delays in step 1, but we cannot guarantee no other
transaction between them.

This is exactly where the patch for synchronized snapshot comes into
the game. See https://commitfest.postgresql.org/action/patch_view?id=480

* Can we inherit connections to child processes with fork() ?
Moreover, we also need to pass running transactions to children.
I wonder libpq is designed for such usage.

As far as I know you can inherit sockets to a child program, as long
as you make sure that after the fork only one, father or child, uses
the socket, the other one should close it. But this wouldn't be a
matter with the above mentioned patch anyway.

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

Windows doesn't support pipes, but offers the message queues to
exchange messages. Parallel pg_dump only exchanges messages in the
form of "DUMP 39209" or "RESTORE OK 48 23 93", it doesn't exchange any
large chunks of binary data, just these small textual messages. The
messages also stay within the same process, they are just sent between
the different threads. The windows part worked just fine when I tested
it last time. Do you have any other technology in mind that you think
is better suited?

Joachim

#20

Magnus Hagander

magnus@hagander.net

over 15 years ago

In reply to: Joachim Wieland (#19)

Re: pg_dump directory archive format / parallel pg_dump

On Sat, Feb 5, 2011 at 04:50, Joachim Wieland <joe@mcknight.de> wrote:

On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

Windows doesn't support pipes, but offers the message queues to
exchange messages. Parallel pg_dump only exchanges messages in the
form of "DUMP 39209" or "RESTORE OK 48 23 93", it doesn't exchange any
large chunks of binary data, just these small textual messages. The
messages also stay within the same process, they are just sent between
the different threads. The windows part worked just fine when I tested
it last time. Do you have any other technology in mind that you think
is better suited?

Haven't been following this thread in details or read the code.. But
our /port directory contains a pipe() implementation for Windows,
that's used for the syslogger at least. Look in the code for pgpipe().
If using that one works, then that should probably be used rather than
something completely custom.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#21

Jaime Casanova

jcasanov@systemguards.com.ec

over 15 years ago

In reply to: Joachim Wieland (#17)

#22

Jaime Casanova

jcasanov@systemguards.com.ec

over 15 years ago

In reply to: Jaime Casanova (#21)

#23