pg_dump directory archive format / parallel pg_dump

Started by Joachim Wielandabout 15 years ago27 messages
#1Joachim Wieland
joe@mcknight.de
3 attachment(s)

Here's a new series of patches for the parallel dump/restore. They need to be
applied on top of each other.

The parallel pg_dump patch does not yet use the synchronized snapshot
functionality from my other patch to not create more dependencies than
necessary.

(1) pg_dump directory archive format (without checks as requested by Heikki)
(2) parallel pg_dump
(3) checks for the directory archive format

Joachim

Attachments:

pg_dump-directory.diff.gzapplication/x-gzip; name=pg_dump-directory.diff.gzDownload
pg_dump-directory-parallel.diff.gzapplication/x-gzip; name=pg_dump-directory-parallel.diff.gzDownload
pg_dump-directory-parallel-checks.diff.gzapplication/x-gzip; name=pg_dump-directory-parallel-checks.diff.gzDownload
#2Jaime Casanova
jaime@2ndquadrant.com
In reply to: Joachim Wieland (#1)
Re: pg_dump directory archive format / parallel pg_dump

On Fri, Jan 7, 2011 at 3:18 PM, Joachim Wieland <joe@mcknight.de> wrote:

Here's a new series of patches for the parallel dump/restore. They need to be
applied on top of each other.

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

#3Joachim Wieland
joe@mcknight.de
In reply to: Jaime Casanova (#2)
3 attachment(s)
Re: pg_dump directory archive format / parallel pg_dump

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

Joachim

Attachments:

pg_dump-directory.diff.gzapplication/x-gzip; name=pg_dump-directory.diff.gzDownload
pg_dump-directory-parallel.diff.gzapplication/x-gzip; name=pg_dump-directory-parallel.diff.gzDownload
pg_dump-directory-parallel-checks.diff.gzapplication/x-gzip; name=pg_dump-directory-parallel-checks.diff.gzDownload
#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#3)
Re: pg_dump directory archive format / parallel pg_dump

On 19.01.2011 07:45, Joachim Wieland wrote:

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova<jaime@2ndquadrant.com> wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.

It would be nice to have this format match the tar format. At the
moment, there's a couple of cosmetic differences:

* TOC file is called "TOC", instead of "toc.dat"

* blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"

* each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

The only significant difference is that in the directory archive format,
each data file has a header in the beginning.

What are the benefits of the data file header? Would it be better to
leave it out, so that the format would be identical to the tar format?
You could then just tar up the directory to get a tar archive, or vice
versa.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Joachim Wieland
joe@mcknight.de
In reply to: Heikki Linnakangas (#4)
Re: pg_dump directory archive format / parallel pg_dump

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

It would be nice to have this format match the tar format. At the moment,
there's a couple of cosmetic differences:

* TOC file is called "TOC", instead of "toc.dat"

* blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"

* each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

That can be done easily...

The only significant difference is that in the directory archive format,
each data file has a header in the beginning.

What are the benefits of the data file header? Would it be better to leave
it out, so that the format would be identical to the tar format? You could
then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

If you want to drop the header altogether, fine with me but if it's
just for the tar <-> directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

Joachim

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#5)
Re: pg_dump directory archive format / parallel pg_dump

On 19.01.2011 16:01, Joachim Wieland wrote:

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

Ok. I'm going to leave out the filesize. I can see some value in that,
and the CRC, but I don't want to add stuff that's not used at this point.

It would be nice to have this format match the tar format. At the moment,
there's a couple of cosmetic differences:

* TOC file is called "TOC", instead of "toc.dat"

* blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"

* each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

That can be done easily...

The only significant difference is that in the directory archive format,
each data file has a header in the beginning.

What are the benefits of the data file header? Would it be better to leave
it out, so that the format would be identical to the tar format? You could
then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file
format issue has been thought of already: there's still code there to
add .gz suffix for compressed files. How about adopting that convention
in the directory format too? That would make an uncompressed directory
format compatible with the tar format.

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

Now that we have an API for compression in compress_io.c, it probably
wouldn't be very hard to implement the missing compression support to
tar format either.

If you want to drop the header altogether, fine with me but if it's
just for the tar<-> directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to
me if you can do things like parallel pg_dump in directory format first,
and then tar it up to a file for archival.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Joachim Wieland
joe@mcknight.de
In reply to: Heikki Linnakangas (#6)
Re: pg_dump directory archive format / parallel pg_dump

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

Ok. I'm going to leave out the filesize. I can see some value in that, and
the CRC, but I don't want to add stuff that's not used at this point.

Okay.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file format
issue has been thought of already: there's still code there to add .gz
suffix for compressed files. How about adopting that convention in the
directory format too? That would make an uncompressed directory format
compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

This is a valid use case for people who are concerned with a fast
dump, usually they would dump uncompressed and later compress the
archive. However once we have parallel pg_dump, this advantage
vanishes.

Now that we have an API for compression in compress_io.c, it probably
wouldn't be very hard to implement the missing compression support to tar
format either.

True, but the question to the advantage of the tar format remains :-)

A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to me
if you can do things like parallel pg_dump in directory format first, and
then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?

Joachim

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Joachim Wieland (#7)
Re: pg_dump directory archive format / parallel pg_dump

On 20.01.2011 15:46, Joachim Wieland wrote:

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

Hmm, tar format doesn't support compression, but looks like the file format
issue has been thought of already: there's still code there to add .gz
suffix for compressed files. How about adopting that convention in the
directory format too? That would make an uncompressed directory format
compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

I'm not sure either. Maybe you want to pipe the output of "pg_dump -F t"
via an ssh tunnel to another host, where you untar it, producing a
directory format dump. You can then edit the directory format dump, and
restore it back to the database without having to tar it again.

It gives you a lot of flexibility if the formats are compatible, which
is generally good.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?

Hmm, I didn't realize pg_restore requires the files to be in certain
order in the tar file. There's no mention of that in the docs either, we
should add that. It doesn't actually require that if you read from a
file, but from stdin it does.

You can put files in the archive in a certain order if you list them
explicitly in the tar command line, like "tar cf backup.tar toc.dat
...". It's hard to know the right order, though. In practice you would
need to do "tar tf backup.tar >files" before untarring, and use "files"
to tar them again in the rightorder.

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

Yeah, we should write the header too. That's not hard, e.g gzopen will
do that automatically, or you can pass a flag to deflateInit2.

A tar archive has the advantage that you can postprocess the dump data
with other tools but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

I don't know how popular it'll be in practice, but it seems very nice to me
if you can do things like parallel pg_dump in directory format first, and
then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?

See above, you can unless you try to pipe it to pg_restore. In fact,
that's listed as an advantage of the tar format over other formats in
the pg_dump documentation.

(I'm working on this, no need to submit a new patch)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Florian Pflug
fgp@phlo.org
In reply to: Heikki Linnakangas (#8)
Re: pg_dump directory archive format / parallel pg_dump

On Jan20, 2011, at 16:22 , Heikki Linnakangas wrote:

You can put files in the archive in a certain order if you list them explicitly in the tar command line, like "tar cf backup.tar toc.dat ...". It's hard to know the right order, though. In practice you would need to do "tar tf backup.tar >files" before untarring, and use "files" to tar them again in the rightorder.

Hm, could we create a file in the backup directory which lists the files in the right order?

best regards,
Florian Pflug

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#8)
1 attachment(s)
Re: pg_dump directory archive format / parallel pg_dump

On 20.01.2011 17:22, Heikki Linnakangas wrote:

(I'm working on this, no need to submit a new patch)

Ok, here's a heavily refactored version of this (also available at
git://git.postgresql.org/git/users/heikki/postgres.git, branch
pg_dump_directory). The directory format is now identical to the tar
format, except that in the directory format the files can be compressed.
Also we don't write the restore.sql file - it would be nice to have, but
pg_restore doesn't require it. We can leave that as a TODO.

I ended up writing another compression abstraction layer in
compress_io.c. It wraps fopen / gzopen etc. in a common API, so that the
caller doesn't need to care if the file is compressed or not. In
hindsight, the compression API we put in earlier didn't suit us very
well. But I guess it wasn't a complete waste, as it moved the gory
details of zlib out of the custom format code.

If compression is used, the files are created with the .gz suffix, and
include the gzip header so that you can manipulate them easily with
gzip/gunzip utilities. When reading, we accept files with or without the
.gz suffix, and you can have some files compressed and others uncompressed.

I haven't updated the documentation yet.

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate
to introduce a parallel --dir option just for this. Any thoughts on this?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

pg_dump_directory-2.patchtext/x-diff; name=pg_dump_directory-2.patchDownload
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index de4968c..5266cc8 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -194,8 +194,11 @@ PostgreSQL documentation
       <term><option>--file=<replaceable class="parameter">file</replaceable></option></term>
       <listitem>
        <para>
-        Send output to the specified file.  If this is omitted, the
-        standard output is used.
+        Send output to the specified file. This parameter can be omitted for file
+        based output formats, in which case the standard output is used. It must
+        be given for the directory output format however, where it specifies the target
+        directory instead of a file. In this case the directory is created by
+        <command>pg_dump</command> and must not exist before.
        </para>
       </listitem>
      </varlistentry>
@@ -226,9 +229,24 @@ PostgreSQL documentation
           <para>
            Output a custom-format archive suitable for input into
            <application>pg_restore</application>.
-           This is the most flexible output format in that it allows manual
-           selection and reordering of archived items during restore.
-           This format is also compressed by default.
+           Together with the directory output format, this is the most flexible
+           output format in that it allows manual selection and reordering of
+           archived items during restore. This format is also compressed by
+           default.
+          </para>
+         </listitem>
+        </varlistentry>
+
+        <varlistentry>
+         <term><literal>d</></term>
+         <term><literal>directory</></term>
+         <listitem>
+          <para>
+           Output a directory-format archive suitable for input into
+           <application>pg_restore</application>. This will create a directory
+           instead of a file and this directory will contain one file for each
+           table and BLOB of the database that is being dumped. This format is
+           compressed by default.
           </para>
          </listitem>
         </varlistentry>
@@ -947,6 +965,14 @@ CREATE DATABASE foo WITH TEMPLATE template0;
   </para>
 
   <para>
+   To dump a database into a directory-format archive:
+
+<screen>
+<prompt>$</prompt> <userinput>pg_dump -Fd mydb -f dumpdir</userinput>
+</screen>
+  </para>
+
+  <para>
    To reload an archive file into a (freshly created) database named
    <literal>newdb</>:
 
diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index db607b4..8410af1 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -20,7 +20,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 
 OBJS=	pg_backup_archiver.o pg_backup_db.o pg_backup_custom.o \
 	pg_backup_files.o pg_backup_null.o pg_backup_tar.o \
-	dumputils.o compress_io.o $(WIN32RES)
+	pg_backup_directory.o dumputils.o compress_io.o $(WIN32RES)
 
 KEYWRDOBJS = keywords.o kwlookup.o
 
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index 8c41a69..506533a 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -7,6 +7,17 @@
  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
+ * This file includes two APIs for dealing with compressed data. The first
+ * provides more flexibility, using callbacks to read/write data from the
+ * underlying stream. The second API is a wrapper around fopen/gzopen and
+ * friends, providing an interface similar to those, but abstracts away
+ * the compression (or not). Both APIs use libz for the compression, but
+ * the second API uses gzip headers, so the resulting files can be easily
+ * manipulated with the gzip utility.
+ *
+ * Compressor API
+ * --------------
+ *
  *  The interface for writing to an archive consists of three functions:
  *  AllocateCompressor, WriteDataToArchive and EndCompressor. First you call
  *  AllocateCompressor, then write all the data by calling WriteDataToArchive
@@ -23,6 +34,17 @@
  *
  *  The interface is the same for compressed and uncompressed streams.
  *
+ * Compressed stream API
+ * ----------------------
+ *
+ *  The compressed stream API is a wrapper around to the C standard fopen()
+ *  and libz's gzopen() API. It allows you to use the same functions for
+ *  compressed and uncompressed streams. cfopen_read() first tries to open
+ *  the file with given name, and if it fails, it tries to open the same
+ *  file with the .gz suffix. cfopen_write() opens a file for writing, an
+ *  extra argument specifies if the file should be compressed, and adds the
+ *  .gz suffix to the filename if it is. This allows you to easily handle both
+ *  compressed and uncompressed files.
  *
  * IDENTIFICATION
  *     src/bin/pg_dump/compress_io.c
@@ -33,6 +55,11 @@
 #include "compress_io.h"
 
 
+/*----------------------
+ * Compressor API
+ *----------------------
+ */
+
 /* typedef appears in compress_io.h */
 struct CompressorState
 {
@@ -48,6 +75,10 @@ struct CompressorState
 
 static const char *modulename = gettext_noop("compress_io");
 
+#ifdef HAVE_LIBZ
+static int	hasSuffix(const char *filename, const char *suffix);
+#endif
+
 static void ParseCompressionOption(int compression, CompressionAlgorithm *alg,
 								   int *level);
 
@@ -418,3 +449,245 @@ WriteDataToArchiveNone(ArchiveHandle *AH, CompressorState *cs,
 }
 
 
+/*----------------------
+ * Compressed stream API
+ *----------------------
+ */
+
+/*
+ * cfp represents an open stream, wrapping the underlying FILE or gzFile
+ * pointer. This is opaque to the callers.
+ */
+struct cfp
+{
+	FILE *uncompressedfp;
+#ifdef HAVE_LIBZ
+	gzFile compressedfp;
+#endif
+};
+
+/*
+ * Return the raw FILE pointer associated with a stream. The stream must be
+ * uncompressed.
+ */
+FILE *
+cfgetfp(cfp *fp)
+{
+	if (fp->uncompressedfp == NULL)
+	{
+		*((int *) (NULL)) = 1;
+		die_horribly(NULL, modulename, "cannot get plain FILE * from a compressed stream\n");
+	}
+	return fp->uncompressedfp;
+}
+
+/*
+ * Open a file for reading. 'path' is the file to open, and 'mode' should
+ * be either "r" or "rb".
+ *
+ * If the file at 'path' does not exist, we append the ".gz" suffix (if 'path'
+ * doesn't already have it) and try again. So if you pass "foo" as 'path',
+ * this will open either "foo" or "foo.gz".
+ */
+cfp *
+cfopen_read(const char *path, const char *mode)
+{
+	cfp *fp;
+
+#ifdef HAVE_LIBZ
+	if (hasSuffix(path, ".gz"))
+		fp = cfopen(path, mode, 1);
+	else
+#endif
+	{
+		fp = cfopen(path, mode, 0);
+#ifdef HAVE_LIBZ
+		if (fp == NULL)
+		{
+			int fnamelen = strlen(path) + 4;
+			char *fname = malloc(fnamelen);
+			if (fname == NULL)
+				die_horribly(NULL, modulename, "Out of memory\n");
+
+			snprintf(fname, fnamelen, "%s%s", path, ".gz");
+			fp = cfopen(fname, mode, 1);
+			free(fname);
+		}
+#endif
+	}
+	return fp;
+}
+
+/*
+ * Open a file for writing. 'path' indicates the path name, and 'mode' must
+ * be a filemode as accepted by fopen() and gzopen() that indicates writing
+ * ("w", "wb", "a", or "ab").
+ *
+ * If 'compression' is non-zero, a gzip compressed stream is opened, and
+ * and 'compression' indicates the compression level used. The ".gz" suffix
+ * is automatically added to 'path' in that case.
+ */
+cfp *
+cfopen_write(const char *path, const char *mode, int compression)
+{
+	cfp *fp;
+
+	if (compression == 0)
+		fp = cfopen(path, mode, 0);
+	else
+	{
+#ifdef HAVE_LIBZ
+		int fnamelen = strlen(path) + 4;
+		char *fname = malloc(fnamelen);
+		if (fname == NULL)
+			die_horribly(NULL, modulename, "Out of memory\n");
+
+		snprintf(fname, fnamelen, "%s%s", path, ".gz");
+		fp = cfopen(fname, mode, 1);
+		free(fname);
+#else
+		die_horribly(NULL, modulename, "not built with zlib support\n");
+#endif
+	}
+	return fp;
+}
+
+/*
+ * Opens file 'path' in 'mode'. If 'compression' is non-zero, the file
+ * is opened with libz gzopen(), otherwise with plain fopen()
+ */
+cfp *
+cfopen(const char *path, const char *mode, int compression)
+{
+	cfp *fp = malloc(sizeof(cfp));
+	if (fp == NULL)
+		die_horribly(NULL, modulename, "Out of memory\n");
+
+	if (compression != 0)
+	{
+#ifdef HAVE_LIBZ
+		fp->compressedfp = gzopen(path, mode);
+		fp->uncompressedfp = NULL;
+		if (fp->compressedfp == NULL)
+		{
+			free(fp);
+			fp = NULL;
+		}
+#else
+		die_horribly(NULL, modulename, "not built with zlib support\n");
+#endif
+	}
+	else
+	{
+#ifdef HAVE_LIBZ
+		fp->compressedfp = NULL;
+#endif
+		fp->uncompressedfp = fopen(path, mode);
+		if (fp->uncompressedfp == NULL)
+		{
+			free(fp);
+			fp = NULL;
+		}
+	}
+
+	return fp;
+}
+
+
+int
+cfread(void *ptr, int size, cfp *fp)
+{
+#ifdef HAVE_LIBZ
+	if (fp->compressedfp)
+		return gzread(fp->compressedfp, ptr, size);
+	else
+#endif
+		return fread(ptr, 1, size, fp->uncompressedfp);
+}
+
+int
+cfwrite(const void *ptr, int size, cfp *fp)
+{
+#ifdef HAVE_LIBZ
+	if (fp->compressedfp)
+		return gzwrite(fp->compressedfp, ptr, size);
+	else
+#endif
+		return fwrite(ptr, 1, size, fp->uncompressedfp);
+}
+
+int
+cfgetc(cfp *fp)
+{
+#ifdef HAVE_LIBZ
+	if (fp->compressedfp)
+		return gzgetc(fp->compressedfp);
+	else
+#endif
+		return fgetc(fp->uncompressedfp);
+}
+
+char *
+cfgets(cfp *fp, char *buf, int len)
+{
+#ifdef HAVE_LIBZ
+	if (fp->compressedfp)
+		return gzgets(fp->compressedfp, buf, len);
+	else
+#endif
+		return fgets(buf, len, fp->uncompressedfp);
+}
+
+int
+cfclose(cfp *fp)
+{
+	int result;
+
+	if (fp == NULL)
+	{
+		errno = EBADF;
+		return EOF;
+	}
+#ifdef HAVE_LIBZ
+	if (fp->compressedfp)
+	{
+		result = gzclose(fp->compressedfp);
+		fp->compressedfp = NULL;
+	}
+	else
+#endif
+	{
+		result = fclose(fp->uncompressedfp);
+		fp->uncompressedfp = NULL;
+	}
+	free(fp);
+
+	return result;
+}
+
+int
+cfeof(cfp *fp)
+{
+#ifdef HAVE_LIBZ
+	if (fp->compressedfp)
+		return gzeof(fp->compressedfp);
+	else
+#endif
+		return feof(fp->uncompressedfp);
+}
+
+#ifdef HAVE_LIBZ
+static int
+hasSuffix(const char *filename, const char *suffix)
+{
+	int filenamelen = strlen(filename);
+	int suffixlen = strlen(suffix);
+
+	if (filenamelen < suffixlen)
+		return 0;
+
+	return memcmp(&filename[filenamelen - suffixlen],
+					suffix,
+					suffixlen) == 0;
+}
+#endif
diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 13e536f..934c04a 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -54,4 +54,18 @@ extern size_t WriteDataToArchive(ArchiveHandle *AH, CompressorState *cs,
 								 const void *data, size_t dLen);
 extern void EndCompressor(ArchiveHandle *AH, CompressorState *cs);
 
+
+typedef struct cfp cfp;
+
+extern FILE *cfgetfp(cfp *fp);
+extern cfp *cfopen(const char *path, const char *mode, int compression);
+extern cfp *cfopen_read(const char *path, const char *mode);
+extern cfp *cfopen_write(const char *path, const char *mode, int compression);
+extern int cfread(void *ptr, int size, cfp *fp);
+extern int cfwrite(const void *ptr, int size, cfp *fp);
+extern int cfgetc(cfp *fp);
+extern char *cfgets(cfp *fp, char *buf, int len);
+extern int cfclose(cfp *fp);
+extern int cfeof(cfp *fp);
+
 #endif
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 8fa9a57..b5803bd 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -48,9 +48,10 @@ typedef enum _archiveFormat
 {
 	archUnknown = 0,
 	archCustom = 1,
-	archFiles = 2,
-	archTar = 3,
-	archNull = 4
+	archDirectory = 2,
+	archFiles = 3,
+	archTar = 4,
+	archNull = 5
 } ArchiveFormat;
 
 typedef enum _archiveMode
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 64d8d93..bacfc4e 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -25,6 +25,7 @@
 
 #include <ctype.h>
 #include <unistd.h>
+#include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/wait.h>
 
@@ -1722,6 +1723,8 @@ _discoverArchiveFormat(ArchiveHandle *AH)
 	char		sig[6];			/* More than enough */
 	size_t		cnt;
 	int			wantClose = 0;
+	char		buf[MAXPGPATH];
+	struct stat	st;
 
 #if 0
 	write_msg(modulename, "attempting to ascertain archive format\n");
@@ -1738,10 +1741,37 @@ _discoverArchiveFormat(ArchiveHandle *AH)
 	if (AH->fSpec)
 	{
 		wantClose = 1;
-		fh = fopen(AH->fSpec, PG_BINARY_R);
-		if (!fh)
-			die_horribly(AH, modulename, "could not open input file \"%s\": %s\n",
-						 AH->fSpec, strerror(errno));
+		/*
+		 * Check if the specified archive is a directory. If so, we open its
+		 * TOC file.
+		 */
+		buf[0] = '\0';
+		if (stat(AH->fSpec, &st) == 0 && S_ISDIR(st.st_mode))
+		{
+			if (snprintf(buf, MAXPGPATH, "%s/%s", AH->fSpec, "toc.dat") >= MAXPGPATH)
+				die_horribly(AH, modulename, "directory name too long: \"%s\"\n",
+							 AH->fSpec);
+			fh = fopen(buf, PG_BINARY_R);
+			if (!fh)
+			{
+#ifdef HAVE_LIBZ
+				/* the archive format accepts a gzipped toc.dat as well */
+				if (snprintf(buf, MAXPGPATH, "%s/%s", AH->fSpec, "toc.dat.gz") >= MAXPGPATH)
+					die_horribly(AH, modulename, "directory name too long: \"%s\"\n",
+								 AH->fSpec);
+				fh = fopen(buf, PG_BINARY_R);
+#endif
+				if (!fh)
+					die_horribly(AH, modulename, "input directory does not appear to be a valid archive\n");
+			}
+		}
+		else
+		{
+			fh = fopen(AH->fSpec, PG_BINARY_R);
+			if (!fh)
+				die_horribly(AH, modulename, "could not open input file \"%s\": %s\n",
+							 AH->fSpec, strerror(errno));
+		}
 	}
 	else
 	{
@@ -1951,6 +1981,10 @@ _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			InitArchiveFmt_Custom(AH);
 			break;
 
+		case archDirectory:
+			InitArchiveFmt_Directory(AH);
+			break;
+
 		case archFiles:
 			InitArchiveFmt_Files(AH);
 			break;
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index d9378df..4b6d9f1 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -374,6 +374,7 @@ extern void EndRestoreBlob(ArchiveHandle *AH, Oid oid);
 extern void EndRestoreBlobs(ArchiveHandle *AH);
 
 extern void InitArchiveFmt_Custom(ArchiveHandle *AH);
+extern void InitArchiveFmt_Directory(ArchiveHandle *AH);
 extern void InitArchiveFmt_Files(ArchiveHandle *AH);
 extern void InitArchiveFmt_Null(ArchiveHandle *AH);
 extern void InitArchiveFmt_Tar(ArchiveHandle *AH);
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
new file mode 100644
index 0000000..9565411
--- /dev/null
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -0,0 +1,672 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_backup_directory.c
+ *
+ *  A directory format dump is a directory, which contains a "toc.dat" file
+ *  for the TOC, and a separate file for each data entry, named "<oid>.dat".
+ *  Large objects (BLOBs) are stored in separate files named "blob_<uid>.dat",
+ *  and there's a plain-text TOC file for them called "blobs.toc". If
+ *  compression is used, each data file is individually compressed and the
+ *  ".gz" suffix is added to the filenames. The TOC files, however, are not
+ *  compressed.
+ *
+ *  NOTE: This format is identical to the files written in the tar file in
+ *  the 'tar' format, except that we don't write the restore.sql file, and
+ *  the tar format doesn't support compression. Please keep the formats in
+ *  sync.
+ *
+ *
+ *  Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ *  Portions Copyright (c) 1994, Regents of the University of California
+ *  Portions Copyright (c) 2000, Philip Warner
+ *
+ *  Rights are granted to use this software in any way so long
+ *  as this notice is not removed.
+ *
+ *  The author is not responsible for loss or damages that may
+ *  result from it's use.
+ *
+ * IDENTIFICATION
+ *     src/bin/pg_dump/pg_backup_directory.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <dirent.h>
+#include <sys/stat.h>
+
+#include "pg_backup_archiver.h"
+#include "compress_io.h"
+
+typedef struct
+{
+	/*
+	 * Our archive location. This is basically what the user specified as his
+	 * backup file but of course here it is a directory.
+	 */
+	char			   *directory;
+
+	cfp				   *dataFH;				/* currently open data file */
+
+	cfp				   *blobsTocFH;			/* file handle for blobs.toc */
+} lclContext;
+
+typedef struct
+{
+	char	   *filename;		/* filename excluding the directory (basename) */
+} lclTocEntry;
+
+static const char *modulename = gettext_noop("directory archiver");
+
+/* prototypes for private functions */
+static void _ArchiveEntry(ArchiveHandle *AH, TocEntry *te);
+static void _StartData(ArchiveHandle *AH, TocEntry *te);
+static void _EndData(ArchiveHandle *AH, TocEntry *te);
+static size_t _WriteData(ArchiveHandle *AH, const void *data, size_t dLen);
+static int	_WriteByte(ArchiveHandle *AH, const int i);
+static int	_ReadByte(ArchiveHandle *);
+static size_t _WriteBuf(ArchiveHandle *AH, const void *buf, size_t len);
+static size_t _ReadBuf(ArchiveHandle *AH, void *buf, size_t len);
+static void _CloseArchive(ArchiveHandle *AH);
+static void _PrintTocData(ArchiveHandle *AH, TocEntry *te, RestoreOptions *ropt);
+
+static void _WriteExtraToc(ArchiveHandle *AH, TocEntry *te);
+static void _ReadExtraToc(ArchiveHandle *AH, TocEntry *te);
+static void _PrintExtraToc(ArchiveHandle *AH, TocEntry *te);
+
+static void _StartBlobs(ArchiveHandle *AH, TocEntry *te);
+static void _StartBlob(ArchiveHandle *AH, TocEntry *te, Oid oid);
+static void _EndBlob(ArchiveHandle *AH, TocEntry *te, Oid oid);
+static void _EndBlobs(ArchiveHandle *AH, TocEntry *te);
+static void _LoadBlobs(ArchiveHandle *AH, RestoreOptions *ropt);
+
+static char *prependDirectory(ArchiveHandle *AH, const char *relativeFilename);
+
+static void createDirectory(const char *dir);
+
+
+/*
+ *	Init routine required by ALL formats. This is a global routine
+ *	and should be declared in pg_backup_archiver.h
+ *
+ *	Its task is to create any extra archive context (using AH->formatData),
+ *	and to initialize the supported function pointers.
+ *
+ *	It should also prepare whatever its input source is for reading/writing,
+ *	and in the case of a read mode connection, it should load the Header & TOC.
+ */
+void
+InitArchiveFmt_Directory(ArchiveHandle *AH)
+{
+	lclContext *ctx;
+
+	/* Assuming static functions, this can be copied for each format. */
+	AH->ArchiveEntryPtr = _ArchiveEntry;
+	AH->StartDataPtr = _StartData;
+	AH->WriteDataPtr = _WriteData;
+	AH->EndDataPtr = _EndData;
+	AH->WriteBytePtr = _WriteByte;
+	AH->ReadBytePtr = _ReadByte;
+	AH->WriteBufPtr = _WriteBuf;
+	AH->ReadBufPtr = _ReadBuf;
+	AH->ClosePtr = _CloseArchive;
+	AH->ReopenPtr = NULL;
+	AH->PrintTocDataPtr = _PrintTocData;
+	AH->ReadExtraTocPtr = _ReadExtraToc;
+	AH->WriteExtraTocPtr = _WriteExtraToc;
+	AH->PrintExtraTocPtr = _PrintExtraToc;
+
+	AH->StartBlobsPtr = _StartBlobs;
+	AH->StartBlobPtr = _StartBlob;
+	AH->EndBlobPtr = _EndBlob;
+	AH->EndBlobsPtr = _EndBlobs;
+
+	AH->ClonePtr = NULL;
+	AH->DeClonePtr = NULL;
+
+	/*
+	 * Set up some special context used in compressing data.
+	 */
+	ctx = (lclContext *) calloc(1, sizeof(lclContext));
+	if (ctx == NULL)
+		die_horribly(AH, modulename, "out of memory\n");
+	AH->formatData = (void *) ctx;
+
+	ctx->dataFH = NULL;
+	ctx->blobsTocFH = NULL;
+
+	/* Initialize LO buffering */
+	AH->lo_buf_size = LOBBUFSIZE;
+	AH->lo_buf = (void *) malloc(LOBBUFSIZE);
+	if (AH->lo_buf == NULL)
+		die_horribly(AH, modulename, "out of memory\n");
+
+	/*
+	 * Now open the TOC file
+	 */
+
+	if (!AH->fSpec || strcmp(AH->fSpec, "") == 0)
+		die_horribly(AH, modulename, "no directory specified\n");
+
+	ctx->directory = AH->fSpec;
+
+	if (AH->mode == archModeWrite)
+	{
+		/* Create the directory, errors are caught there */
+		createDirectory(ctx->directory);
+	}
+	else
+	{							/* Read Mode */
+		char	   *fname;
+		cfp		   *tocFH;
+
+		fname = prependDirectory(AH, "toc.dat");
+
+		tocFH = cfopen_read(fname, PG_BINARY_R);
+		if (tocFH == NULL)
+			die_horribly(AH, modulename,
+						 "could not open input file \"%s\": %s\n",
+						 fname, strerror(errno));
+
+		ctx->dataFH = tocFH;
+		ReadHead(AH);
+		ReadToc(AH);
+
+		/* Nothing else in the file, so close it again... */
+		if (cfclose(tocFH) != 0)
+			die_horribly(AH, modulename, "could not close TOC file: %s\n",
+						 strerror(errno));
+		ctx->dataFH = NULL;
+	}
+}
+
+/*
+ * Called by the Archiver when the dumper creates a new TOC entry.
+ *
+ * We determine the filename for this entry.
+*/
+static void
+_ArchiveEntry(ArchiveHandle *AH, TocEntry *te)
+{
+	lclTocEntry	   *tctx;
+	char			fn[MAXPGPATH];
+
+	tctx = (lclTocEntry *) calloc(1, sizeof(lclTocEntry));
+	if (!tctx)
+		die_horribly(AH, modulename, "out of memory\n");
+	if (te->dataDumper)
+	{
+		snprintf(fn, MAXPGPATH, "%d.dat", te->dumpId);
+		tctx->filename = strdup(fn);
+	}
+	else if (strcmp(te->desc, "BLOBS") == 0)
+		tctx->filename = strdup("blobs.toc");
+	else
+		tctx->filename = NULL;
+
+	te->formatData = (void *) tctx;
+}
+
+/*
+ * Called by the Archiver to save any extra format-related TOC entry
+ * data.
+ *
+ * Use the Archiver routines to write data - they are non-endian, and
+ * maintain other important file information.
+ */
+static void
+_WriteExtraToc(ArchiveHandle *AH, TocEntry *te)
+{
+	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
+
+	/*
+	 * A dumpable object has set tctx->filename, any other object has not.
+	 * (see _ArchiveEntry).
+	 */
+	if (tctx->filename)
+		WriteStr(AH, tctx->filename);
+	else
+		WriteStr(AH, "");
+}
+
+/*
+ * Called by the Archiver to read any extra format-related TOC data.
+ *
+ * Needs to match the order defined in _WriteExtraToc, and should also
+ * use the Archiver input routines.
+ */
+static void
+_ReadExtraToc(ArchiveHandle *AH, TocEntry *te)
+{
+	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
+
+	if (tctx == NULL)
+	{
+		tctx = (lclTocEntry *) calloc(1, sizeof(lclTocEntry));
+		if (!tctx)
+			die_horribly(AH, modulename, "out of memory\n");
+		te->formatData = (void *) tctx;
+	}
+
+	tctx->filename = ReadStr(AH);
+	if (strlen(tctx->filename) == 0)
+	{
+		free(tctx->filename);
+		tctx->filename = NULL;
+	}
+}
+
+/*
+ * Called by the Archiver when restoring an archive to output a comment
+ * that includes useful information about the TOC entry.
+ */
+static void
+_PrintExtraToc(ArchiveHandle *AH, TocEntry *te)
+{
+	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
+
+	if (AH->public.verbose && tctx->filename)
+		ahprintf(AH, "-- File: %s\n", tctx->filename);
+}
+
+/*
+ * Called by the archiver when saving TABLE DATA (not schema). This routine
+ * should save whatever format-specific information is needed to read
+ * the archive back.
+ *
+ * It is called just prior to the dumper's 'DataDumper' routine being called.
+ *
+ * We create the data file for writing.
+ */
+static void
+_StartData(ArchiveHandle *AH, TocEntry *te)
+{
+	lclTocEntry	   *tctx = (lclTocEntry *) te->formatData;
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+	char		   *fname;
+
+	fname = prependDirectory(AH, tctx->filename);
+
+	ctx->dataFH = cfopen_write(fname, PG_BINARY_W, AH->compression);
+	if (ctx->dataFH == NULL)
+		die_horribly(AH, modulename, "could not open output file \"%s\": %s\n",
+					 fname, strerror(errno));
+}
+
+/*
+ * Called by archiver when dumper calls WriteData. This routine is
+ * called for both BLOB and TABLE data; it is the responsibility of
+ * the format to manage each kind of data using StartBlob/StartData.
+ *
+ * It should only be called from within a DataDumper routine.
+ *
+ * We write the data to the open data file.
+ */
+static size_t
+_WriteData(ArchiveHandle *AH, const void *data, size_t dLen)
+{
+	lclContext		   *ctx = (lclContext *) AH->formatData;
+
+	if (dLen == 0)
+		return 0;
+
+	return cfwrite(data, dLen, ctx->dataFH);
+}
+
+/*
+ * Called by the archiver when a dumper's 'DataDumper' routine has
+ * finished.
+ *
+ * We close the data file.
+ */
+static void
+_EndData(ArchiveHandle *AH, TocEntry *te)
+{
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+
+	/* Close the file */
+	cfclose(ctx->dataFH);
+
+	ctx->dataFH = NULL;
+}
+
+/*
+ * Print data for a given file (can be a BLOB as well)
+ */
+static void
+_PrintFileData(ArchiveHandle *AH, char *filename, RestoreOptions *ropt)
+{
+	size_t		cnt;
+	char	   *buf;
+	size_t		buflen;
+	cfp		   *cfp;
+
+	if (!filename)
+		return;
+
+	cfp  = cfopen_read(filename, PG_BINARY_R);
+	if (!cfp)
+		die_horribly(AH, modulename, "could not open input file \"%s\": %s\n",
+					 filename, strerror(errno));
+
+	buf = malloc(ZLIB_OUT_SIZE);
+	if (buf == NULL)
+		die_horribly(NULL, modulename, "out of memory\n");
+	buflen = ZLIB_OUT_SIZE;
+
+	while ((cnt = cfread(buf, buflen, cfp)))
+		ahwrite(buf, 1, cnt, AH);
+
+	free(buf);
+}
+
+/*
+ * Print data for a given TOC entry
+*/
+static void
+_PrintTocData(ArchiveHandle *AH, TocEntry *te, RestoreOptions *ropt)
+{
+	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
+
+	if (!tctx->filename)
+		return;
+
+	if (strcmp(te->desc, "BLOBS") == 0)
+		_LoadBlobs(AH, ropt);
+	else
+	{
+		char   *fname = prependDirectory(AH, tctx->filename);
+		_PrintFileData(AH, fname, ropt);
+	}
+}
+
+static void
+_LoadBlobs(ArchiveHandle *AH, RestoreOptions *ropt)
+{
+	Oid				oid;
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+	char		   *fname;
+	char			line[MAXPGPATH];
+
+	StartRestoreBlobs(AH);
+
+	fname = prependDirectory(AH, "blobs.toc");
+
+	ctx->blobsTocFH = cfopen_read(fname, PG_BINARY_R);
+
+	if (ctx->blobsTocFH == NULL)
+		die_horribly(AH, modulename, "could not open large object TOC file \"%s\" for input: %s\n",
+					 fname, strerror(errno));
+
+	/* we cannot test for feof() since EOF only shows up in the low
+	 * level read functions. But they would die_horribly() anyway. */
+	while ((cfgets(ctx->blobsTocFH, line, MAXPGPATH)) != NULL)
+	{
+		char		fname[MAXPGPATH];
+		char		path[MAXPGPATH];
+
+		if (sscanf(line, "%u %s\n", &oid, fname) != 2)
+			die_horribly(AH, modulename, "invalid line in large object TOC file: %s\n",
+						 line);
+
+		StartRestoreBlob(AH, oid, ropt->dropSchema);
+		snprintf(path, MAXPGPATH, "%s/%s", ctx->directory, fname);
+		_PrintFileData(AH, path, ropt);
+		EndRestoreBlob(AH, oid);
+	}
+	if (!cfeof(ctx->blobsTocFH))
+		die_horribly(AH, modulename, "error reading large object TOC file \"%s\"\n",
+					 fname);
+
+	if (cfclose(ctx->blobsTocFH) != 0)
+		die_horribly(AH, modulename, "could not close large object TOC file \"%s\": %s\n",
+					 fname, strerror(errno));
+
+	ctx->blobsTocFH = NULL;
+
+	EndRestoreBlobs(AH);
+}
+
+
+/*
+ * Write a byte of data to the archive.
+ * Called by the archiver to do integer & byte output to the archive.
+ * These routines are only used to read & write the headers & TOC.
+ */
+static int
+_WriteByte(ArchiveHandle *AH, const int i)
+{
+	unsigned char c = (unsigned char) i;
+	lclContext *ctx = (lclContext *) AH->formatData;
+
+	if (cfwrite(&c, 1, ctx->dataFH) != 1)
+		die_horribly(AH, modulename, "could not write byte\n");
+
+	return 1;
+}
+
+/*
+ * Read a byte of data from the archive.
+ * Called by the archiver to read bytes & integers from the archive.
+ * These routines are only used to read & write headers & TOC.
+ * EOF should be treated as a fatal error.
+ */
+static int
+_ReadByte(ArchiveHandle *AH)
+{
+	lclContext *ctx = (lclContext *) AH->formatData;
+	int			res;
+
+	res = cfgetc(ctx->dataFH);
+	if (res == EOF)
+		die_horribly(AH, modulename, "unexpected end of file\n");
+
+	return res;
+}
+
+/*
+ * Write a buffer of data to the archive.
+ * Called by the archiver to write a block of bytes to the TOC or a data file.
+ */
+static size_t
+_WriteBuf(ArchiveHandle *AH, const void *buf, size_t len)
+{
+	lclContext *ctx = (lclContext *) AH->formatData;
+	size_t		res;
+
+	res = cfwrite(buf, len, ctx->dataFH);
+	if (res != len)
+		die_horribly(AH, modulename, "could not write to output file: %s\n",
+					 strerror(errno));
+
+	return res;
+}
+
+/*
+ * Read a block of bytes from the archive.
+ *
+ * Mandatory.
+ *
+ * Called by the archiver to read a block of bytes from the archive
+ *
+ */
+static size_t
+_ReadBuf(ArchiveHandle *AH, void *buf, size_t len)
+{
+	lclContext *ctx = (lclContext *) AH->formatData;
+	size_t		res;
+
+	res = cfread(buf, len, ctx->dataFH);
+
+	return res;
+}
+
+/*
+ * Close the archive.
+ *
+ * Mandatory.
+ *
+ * When writing the archive, this is the routine that actually starts
+ * the process of saving it to files. No data should be written prior
+ * to this point, since the user could sort the TOC after creating it.
+ *
+ * If an archive is to be written, this routine must call:
+ *		WriteHead			to save the archive header
+ *		WriteToc			to save the TOC entries
+ *		WriteDataChunks		to save all DATA & BLOBs.
+ */
+static void
+_CloseArchive(ArchiveHandle *AH)
+{
+	lclContext *ctx = (lclContext *) AH->formatData;
+	if (AH->mode == archModeWrite)
+	{
+		cfp	   *tocFH;
+		char   *fname = prependDirectory(AH, "toc.dat");
+
+		/* The TOC is always created uncompressed */
+		tocFH = cfopen_write(fname, PG_BINARY_W, 0);
+		if (tocFH == NULL)
+			die_horribly(AH, modulename, "could not open output file \"%s\": %s\n",
+						 fname, strerror(errno));
+		ctx->dataFH = tocFH;
+		WriteHead(AH);
+		WriteToc(AH);
+		if (cfclose(tocFH) != 0)
+			die_horribly(AH, modulename, "could not close TOC file: %s\n",
+						 strerror(errno));
+		WriteDataChunks(AH);
+	}
+	AH->FH = NULL;
+}
+
+
+/*
+ * BLOB support
+ */
+
+/*
+ * Called by the archiver when starting to save all BLOB DATA (not schema).
+ * It is called just prior to the dumper's DataDumper routine.
+ *
+ * We open the large object TOC file here, so that we can append a line to 
+ * it for each blob.
+ */
+static void
+_StartBlobs(ArchiveHandle *AH, TocEntry *te)
+{
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+	char		   *fname;
+
+	fname = prependDirectory(AH, "blobs.toc");
+
+	/* The blob TOC file is never compressed */
+	ctx->blobsTocFH = cfopen_write(fname, "ab", 0);
+	if (ctx->blobsTocFH == NULL)
+		die_horribly(AH, modulename, "could not open output file \"%s\": %s\n",
+					 fname, strerror(errno));
+}
+
+/*
+ * Called by the archiver when we're about to start dumping a blob.
+ *
+ * We create a file to write the blob to.
+ */
+static void
+_StartBlob(ArchiveHandle *AH, TocEntry *te, Oid oid)
+{
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+	char			fname[MAXPGPATH];
+
+	snprintf(fname, MAXPGPATH, "%s/blob_%u.dat", ctx->directory, oid);
+
+	ctx->dataFH = cfopen_write(fname, PG_BINARY_W, AH->compression);
+
+	if (ctx->dataFH == NULL)
+		die_horribly(AH, modulename, "could not open output file \"%s\": %s\n",
+					 fname, strerror(errno));
+}
+
+/*
+ * Called by the archiver when the dumper is finished writing a blob.
+ *
+ * We close the blob file and write an entry to the blob TOC file for it.
+ */
+static void
+_EndBlob(ArchiveHandle *AH, TocEntry *te, Oid oid)
+{
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+	char			buf[50];
+	int				len;
+
+	/* Close the BLOB data file itself */
+	cfclose(ctx->dataFH);
+	ctx->dataFH = NULL;
+
+	/* register the blob in blobs.toc */
+	len = snprintf(buf, sizeof(buf), "%u blob_%u.dat\n", oid, oid);
+	if (cfwrite(buf, len, ctx->blobsTocFH) != len)
+		die_horribly(AH, modulename, "could not write to blobs TOC file\n");		
+}
+
+/*
+ * Called by the archiver when finishing saving all BLOB DATA.
+ *
+ * We close the blobs TOC file.
+ */
+static void
+_EndBlobs(ArchiveHandle *AH, TocEntry *te)
+{
+	lclContext *ctx = (lclContext *) AH->formatData;
+
+	cfclose(ctx->blobsTocFH);
+	ctx->blobsTocFH = NULL;
+}
+
+static void
+createDirectory(const char *dir)
+{
+	struct stat		st;
+
+	/* the directory must not exist yet. */
+	if (stat(dir, &st) == 0)
+	{
+		if (S_ISDIR(st.st_mode))
+			die_horribly(NULL, modulename,
+						 "cannot create directory %s, it exists already\n",
+						 dir);
+		else
+			die_horribly(NULL, modulename,
+						 "cannot create directory %s, a file with this name "
+						 "exists already\n", dir);
+	}
+
+	/*
+	 * Now we create the directory. Note that for some race condition we could
+	 * also run into the situation that the directory has been created just
+	 * between our two calls.
+	 */
+	if (mkdir(dir, 0700) < 0)
+		die_horribly(NULL, modulename, "could not create directory %s: %s",
+					 dir, strerror(errno));
+}
+
+
+static char *
+prependDirectory(ArchiveHandle *AH, const char *relativeFilename)
+{
+	lclContext	   *ctx = (lclContext *) AH->formatData;
+	static char		buf[MAXPGPATH];
+	char		   *dname;
+
+	dname = ctx->directory;
+
+	if (strlen(dname) + 1 + strlen(relativeFilename) + 1 > MAXPGPATH)
+			die_horribly(AH, modulename, "path name too long: %s", dname);
+
+	strcpy(buf, dname);
+	strcat(buf, "/");
+	strcat(buf, relativeFilename);
+
+	return buf;
+}
diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index 006f7da..e12552a 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -4,6 +4,10 @@
  *
  *	This file is copied from the 'files' format file, but dumps data into
  *	one temp file then sends it to the output TAR archive.
+ * 
+ *  NOTE: If you untar the created 'tar' file, the resulting files are
+ *  compatible with the 'directory' format. Please keep the two formats in
+ *  sync.
  *
  *	See the headers to pg_backup_files & pg_restore for more details.
  *
@@ -167,7 +171,7 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 		die_horribly(AH, modulename, "out of memory\n");
 
 	/*
-	 * Now open the TOC file
+	 * Now open the tar file, and load the TOC if we're in read mode.
 	 */
 	if (AH->mode == archModeWrite)
 	{
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 40b414b..bcac622 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -138,6 +138,7 @@ static int	no_unlogged_table_data = 0;
 
 
 static void help(const char *progname);
+static ArchiveFormat parseArchiveFormat(const char *format, ArchiveMode *mode);
 static void expand_schema_name_patterns(SimpleStringList *patterns,
 							SimpleOidList *oids);
 static void expand_table_name_patterns(SimpleStringList *patterns,
@@ -267,6 +268,8 @@ main(int argc, char **argv)
 	int			my_version;
 	int			optindex;
 	RestoreOptions *ropt;
+	ArchiveFormat archiveFormat = archUnknown;
+	ArchiveMode	archiveMode;
 
 	static int	disable_triggers = 0;
 	static int	outputNoTablespaces = 0;
@@ -539,36 +542,31 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
-	/* open the output file */
-	if (pg_strcasecmp(format, "a") == 0 || pg_strcasecmp(format, "append") == 0)
-	{
-		/* This is used by pg_dumpall, and is not documented */
-		plainText = 1;
-		g_fout = CreateArchive(filename, archNull, 0, archModeAppend);
-	}
-	else if (pg_strcasecmp(format, "c") == 0 || pg_strcasecmp(format, "custom") == 0)
-		g_fout = CreateArchive(filename, archCustom, compressLevel, archModeWrite);
-	else if (pg_strcasecmp(format, "f") == 0 || pg_strcasecmp(format, "file") == 0)
-	{
-		/*
-		 * Dump files into the current directory; for demonstration only, not
-		 * documented.
-		 */
-		g_fout = CreateArchive(filename, archFiles, compressLevel, archModeWrite);
-	}
-	else if (pg_strcasecmp(format, "p") == 0 || pg_strcasecmp(format, "plain") == 0)
-	{
+	archiveFormat = parseArchiveFormat(format, &archiveMode);
+
+	/* archiveFormat specific setup */
+	if (archiveFormat == archNull)
 		plainText = 1;
-		g_fout = CreateArchive(filename, archNull, 0, archModeWrite);
-	}
-	else if (pg_strcasecmp(format, "t") == 0 || pg_strcasecmp(format, "tar") == 0)
-		g_fout = CreateArchive(filename, archTar, compressLevel, archModeWrite);
-	else
+
+	/*
+	 * Ignore compression level for plain format. XXX: This is a bit
+	 * inconsistent, tar-format throws an error instead.
+	 */
+	if (archiveFormat == archNull)
+		compressLevel = 0;
+
+	/* Custom and directory formats are compressed by default */
+	if (compressLevel == -1)
 	{
-		write_msg(NULL, "invalid output format \"%s\" specified\n", format);
-		exit(1);
+		if (archiveFormat == archCustom || archiveFormat == archDirectory)
+			compressLevel = Z_DEFAULT_COMPRESSION;
+		else
+			compressLevel = 0;
 	}
 
+	/* open the output file */
+	g_fout = CreateArchive(filename, archiveFormat, compressLevel, archiveMode);
+
 	if (g_fout == NULL)
 	{
 		write_msg(NULL, "could not open output file \"%s\" for writing\n", filename);
@@ -835,8 +833,8 @@ help(const char *progname)
 	printf(_("  %s [OPTION]... [DBNAME]\n"), progname);
 
 	printf(_("\nGeneral options:\n"));
-	printf(_("  -f, --file=FILENAME         output file name\n"));
-	printf(_("  -F, --format=c|t|p          output file format (custom, tar, plain text)\n"));
+	printf(_("  -f, --file=FILENAME         output file or directory name\n"));
+	printf(_("  -F, --format=c|d|t|p        output file format (custom, directory, tar, plain text)\n"));
 	printf(_("  -v, --verbose               verbose mode\n"));
 	printf(_("  -Z, --compress=0-9          compression level for compressed formats\n"));
 	printf(_("  --lock-wait-timeout=TIMEOUT fail after waiting TIMEOUT for a table lock\n"));
@@ -894,6 +892,49 @@ exit_nicely(void)
 	exit(1);
 }
 
+static ArchiveFormat
+parseArchiveFormat(const char *format, ArchiveMode *mode)
+{
+	ArchiveFormat archiveFormat;
+
+	*mode = archModeWrite;
+
+	if (pg_strcasecmp(format, "a") == 0 || pg_strcasecmp(format, "append") == 0)
+	{
+		/* This is used by pg_dumpall, and is not documented */
+		archiveFormat = archNull;
+		*mode = archModeAppend;
+	}
+	else if (pg_strcasecmp(format, "c") == 0)
+		archiveFormat = archCustom;
+	else if (pg_strcasecmp(format, "custom") == 0)
+		archiveFormat = archCustom;
+	else if (pg_strcasecmp(format, "d") == 0)
+		archiveFormat = archDirectory;
+	else if (pg_strcasecmp(format, "directory") == 0)
+		archiveFormat = archDirectory;
+	else if (pg_strcasecmp(format, "f") == 0 || pg_strcasecmp(format, "file") == 0)
+		/*
+		 * Dump files into the current directory; for demonstration only, not
+		 * documented.
+		 */
+		archiveFormat = archFiles;
+	else if (pg_strcasecmp(format, "p") == 0)
+		archiveFormat = archNull;
+	else if (pg_strcasecmp(format, "plain") == 0)
+		archiveFormat = archNull;
+	else if (pg_strcasecmp(format, "t") == 0)
+		archiveFormat = archTar;
+	else if (pg_strcasecmp(format, "tar") == 0)
+		archiveFormat = archTar;
+	else
+	{
+		write_msg(NULL, "invalid output format \"%s\" specified\n", format);
+		exit(1);
+	}
+	return archiveFormat;
+}
+
 /*
  * Find the OIDs of all schemas matching the given list of patterns,
  * and append them to the given OID list.
@@ -2187,7 +2228,9 @@ dumpBlobs(Archive *AH, void *arg)
 					exit_nicely();
 				}
 
-				WriteData(AH, buf, cnt);
+				/* we try to avoid writing empty chunks */
+				if (cnt > 0)
+					WriteData(AH, buf, cnt);
 			} while (cnt > 0);
 
 			lo_close(g_conn, loFd);
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index 1ddba72..37793ad 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -352,6 +352,11 @@ main(int argc, char **argv)
 				opts->format = archCustom;
 				break;
 
+			case 'd':
+			case 'D':
+				opts->format = archDirectory;
+				break;
+
 			case 'f':
 			case 'F':
 				opts->format = archFiles;
@@ -363,7 +368,7 @@ main(int argc, char **argv)
 				break;
 
 			default:
-				write_msg(NULL, "unrecognized archive format \"%s\"; please specify \"c\" or \"t\"\n",
+				write_msg(NULL, "unrecognized archive format \"%s\"; please specify \"c\", \"d\" or \"t\"\n",
 						  opts->formatName);
 				exit(1);
 		}
@@ -418,7 +423,7 @@ usage(const char *progname)
 	printf(_("\nGeneral options:\n"));
 	printf(_("  -d, --dbname=NAME        connect to database name\n"));
 	printf(_("  -f, --file=FILENAME      output file name\n"));
-	printf(_("  -F, --format=c|t         backup file format (should be automatic)\n"));
+	printf(_("  -F, --format=c|d|t       backup file format (should be automatic)\n"));
 	printf(_("  -l, --list               print summarized TOC of the archive\n"));
 	printf(_("  -v, --verbose            verbose mode\n"));
 	printf(_("  --help                   show this help, then exit\n"));
#11Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: pg_dump directory archive format / parallel pg_dump

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate to
introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output. But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

-f, --file=FILENAME output file (or directory) name
-F, --format=c|t|p|d output file format (custom, tar, text, dir)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#11)
Re: pg_dump directory archive format / parallel pg_dump

On 21.01.2011 15:35, Robert Haas wrote:

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate to
introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output. But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

-f, --file=FILENAME output file (or directory) name
-F, --format=c|t|p|d output file format (custom, tar, text, dir)

Ok, that's exactly what the patch does now. I guess it's fine then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Heikki Linnakangas (#12)
Re: pg_dump directory archive format / parallel pg_dump

On 01/21/2011 10:34 AM, Heikki Linnakangas wrote:

On 21.01.2011 15:35, Robert Haas wrote:

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd
hate to
introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output. But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

-f, --file=FILENAME output file (or directory) name
-F, --format=c|t|p|d output file format (custom, tar, text,
dir)

Ok, that's exactly what the patch does now. I guess it's fine then.

Maybe we could change the hint to say "--file=DESTINATION" or
"--file=FILENAME|DIRNAME" ?

Just a thought.

cheers

andrew

In reply to: Andrew Dunstan (#13)
Re: pg_dump directory archive format / parallel pg_dump

Em 21-01-2011 12:47, Andrew Dunstan escreveu:

Maybe we could change the hint to say "--file=DESTINATION" or
"--file=FILENAME|DIRNAME" ?

... "--file=OUTPUT" or "--file=OUTPUTNAME".

--
Euler Taveira de Oliveira
http://www.timbira.com/

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Euler Taveira de Oliveira (#14)
Re: pg_dump directory archive format / parallel pg_dump

On 21.01.2011 19:11, Euler Taveira de Oliveira wrote:

Em 21-01-2011 12:47, Andrew Dunstan escreveu:

Maybe we could change the hint to say "--file=DESTINATION" or
"--file=FILENAME|DIRNAME" ?

... "--file=OUTPUT" or "--file=OUTPUTNAME".

Ok, works for me.

I've committed this patch now, with a whole bunch of further fixes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Robert Haas
robertmhaas@gmail.com
In reply to: Joachim Wieland (#3)
Re: pg_dump directory archive format / parallel pg_dump

On Wed, Jan 19, 2011 at 12:45 AM, Joachim Wieland <joe@mcknight.de> wrote:

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies. Please rebase.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Joachim Wieland
joe@mcknight.de
In reply to: Robert Haas (#16)
1 attachment(s)
Re: pg_dump directory archive format / parallel pg_dump

On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies.  Please rebase.

Here is a rebased version with some minor changes as well. I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.

Joachim

Attachments:

parallel_pg_dump.patch.gzapplication/x-gzip; name=parallel_pg_dump.patch.gzDownload
#18Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Joachim Wieland (#17)
Re: pg_dump directory archive format / parallel pg_dump

On Wed, Feb 2, 2011 at 13:32, Joachim Wieland <joe@mcknight.de> wrote:

Here is a rebased version with some minor changes as well.

I read the patch works as below. Am I understanding correctly?
1. Open all connections in a parent process.
2. Start transactions for each connection in the parent.
3. Spawn child processes with fork().
4. Each child process uses one of the inherited connections.

I think we have 2 important technical issues here:
* The consistency is not perfect. Each transaction is started
with small delays in step 1, but we cannot guarantee no other
transaction between them.
* Can we inherit connections to child processes with fork() ?
Moreover, we also need to pass running transactions to children.
I wonder libpq is designed for such usage.

To solve both issues, we might want a way to control visibility
in a database server instead of client programs. Don't we need
server-side support like [1]http://wiki.postgresql.org/wiki/ClusterFeatures#Export_snapshots_to_other_sessions before developing parallel dump?
[1]: http://wiki.postgresql.org/wiki/ClusterFeatures#Export_snapshots_to_other_sessions

I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

--
Itagaki Takahiro

#19Joachim Wieland
joe@mcknight.de
In reply to: Itagaki Takahiro (#18)
Re: pg_dump directory archive format / parallel pg_dump

On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

I think we have 2 important technical issues here:
 * The consistency is not perfect. Each transaction is started
  with small delays in step 1, but we cannot guarantee no other
  transaction between them.

This is exactly where the patch for synchronized snapshot comes into
the game. See https://commitfest.postgresql.org/action/patch_view?id=480

 * Can we inherit connections to child processes with fork() ?
  Moreover, we also need to pass running transactions to children.
  I wonder libpq is designed for such usage.

As far as I know you can inherit sockets to a child program, as long
as you make sure that after the fork only one, father or child, uses
the socket, the other one should close it. But this wouldn't be a
matter with the above mentioned patch anyway.

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

Windows doesn't support pipes, but offers the message queues to
exchange messages. Parallel pg_dump only exchanges messages in the
form of "DUMP 39209" or "RESTORE OK 48 23 93", it doesn't exchange any
large chunks of binary data, just these small textual messages. The
messages also stay within the same process, they are just sent between
the different threads. The windows part worked just fine when I tested
it last time. Do you have any other technology in mind that you think
is better suited?

Joachim

#20Magnus Hagander
magnus@hagander.net
In reply to: Joachim Wieland (#19)
Re: pg_dump directory archive format / parallel pg_dump

On Sat, Feb 5, 2011 at 04:50, Joachim Wieland <joe@mcknight.de> wrote:

On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

Windows doesn't support pipes, but offers the message queues to
exchange messages. Parallel pg_dump only exchanges messages in the
form of "DUMP 39209" or "RESTORE OK 48 23 93", it doesn't exchange any
large chunks of binary data, just these small textual messages. The
messages also stay within the same process, they are just sent between
the different threads. The windows part worked just fine when I tested
it last time. Do you have any other technology in mind that you think
is better suited?

Haven't been following this thread in details or read the code.. But
our /port directory contains a pipe() implementation for Windows,
that's used for the syslogger at least. Look in the code for pgpipe().
If using that one works, then that should probably be used rather than
something completely custom.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#21Jaime Casanova
jaime@2ndquadrant.com
In reply to: Joachim Wieland (#17)
Re: pg_dump directory archive format / parallel pg_dump

On Tue, Feb 1, 2011 at 11:32 PM, Joachim Wieland <joe@mcknight.de> wrote:

On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies.  Please rebase.

Here is a rebased version with some minor changes as well. I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.

code review:

something i found, and is a very simple one, is this warning (there's
a similar issue in _StartMasterParallel with the buf variable)
"""
pg_backup_directory.c: In function ‘_EndMasterParallel’:
pg_backup_directory.c:856: warning: ‘status’ may be used uninitialized
in this function
"""

i guess the huge amount of info is showing the patch is just for
debugging and will be removed before commit, right?

functional review:

it works good most of the time, just a few points:
- if i interrupt the process the connections stay, i guess it could
catch the signal and finish the connections
- if i have an exclusive lock on a table and a worker starts dumping
it, it fails because it can't take the lock but it just say "it was
ok" and would prefer an error

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

#22Jaime Casanova
jaime@2ndquadrant.com
In reply to: Jaime Casanova (#21)
Re: pg_dump directory archive format / parallel pg_dump

On Sun, Feb 6, 2011 at 2:12 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Tue, Feb 1, 2011 at 11:32 PM, Joachim Wieland <joe@mcknight.de> wrote:

On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies.  Please rebase.

Here is a rebased version with some minor changes as well. I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.

code review:

ah! two other things i forget:

- there is no docs
- pg_dump and pg_restore are inconsistent:
pg_dump requires the directory to be provided with the -f option:
pg_dump -Fd -f dir_dump
pg_restore pass the directory as an argument for -Fd: pg_restore -Fd dir_dump

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

#23Joachim Wieland
joe@mcknight.de
In reply to: Jaime Casanova (#21)
Re: pg_dump directory archive format / parallel pg_dump

Hi Jaime,

thanks for your review!

On Sun, Feb 6, 2011 at 2:12 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

code review:

something i found, and is a very simple one, is this warning (there's
a similar issue in _StartMasterParallel with the buf variable)
"""
pg_backup_directory.c: In function ‘_EndMasterParallel’:
pg_backup_directory.c:856: warning: ‘status’ may be used uninitialized
in this function
"""

Cool. My compiler didn't tell me about this.

i guess the huge amount of info is showing the patch is just for
debugging and will be removed before commit, right?

That's right.

functional review:

it works good most of the time, just a few points:
- if i interrupt the process the connections stay, i guess it could
catch the signal and finish the connections

Hm, well, recovering gracefully out of errors could be improved. In
your example you would signal the children implicitly because the
parent process dies and the pipes to the children would get broken as
well. Of course the parent could more actively terminate the children
but it might not be the best option to just kill them, as then there
will be a lot of "unexpected EOF" connections in the log. So if an
error condition comes up in the parent (as in your example, because
you canceled the process), then ideally the parent should signal the
children with a non-lethal signal and the children should catch this
"please terminate" signal and exit cleanly but as soon as possible. If
the error case comes up at the child however, then we'd need to make
sure that the user sees the error message from the child. This should
work well as-is but currently it could happen that the parent exists
before all of the children have exited. I'll investigate this a bit...

- if i have an exclusive lock on a table and a worker starts dumping
it, it fails because it can't take the lock but it just say "it was
ok" and would prefer an error

I'm getting a clear

pg_dump: [Archivierer] could not lock table public.c: ERROR: could
not obtain lock on relation "c"

but I'll look into this as well.

Regarding your other post:

- there is no docs

True...

- pg_dump and pg_restore are inconsistent:
pg_dump requires the directory to be provided with the -f option:
pg_dump -Fd -f dir_dump
pg_restore pass the directory as an argument for -Fd: pg_restore -Fd dir_dump

Well, this is there with pg_dump and pg_restore currently as well. -F
is the switch for the format and it just takes "d" as the format. The
dir_dump is an option without any switch.

See the output for the --help switches:

Usage:
pg_dump [OPTION]... [DBNAME]

Usage:
pg_restore [OPTION]... [FILE]

So in either case you don't need to give a switch for what you have.
If you run pg_dump you don't give the switch for the database but you
need to give it for the output (-f) and with pg_restore you don't give
a switch for the file that you're restoring but you'd need to give -d
for restoring to a database.

Joachim

#24Robert Haas
robertmhaas@gmail.com
In reply to: Joachim Wieland (#23)
Re: pg_dump directory archive format / parallel pg_dump

On Mon, Feb 7, 2011 at 10:42 PM, Joachim Wieland <joe@mcknight.de> wrote:

i guess the huge amount of info is showing the patch is just for
debugging and will be removed before commit, right?

That's right.

So how close are we to having a committable version of this? Should
we push this out to 9.2?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Robert Haas (#24)
Re: pg_dump directory archive format / parallel pg_dump

On Tue, Feb 8, 2011 at 13:34, Robert Haas <robertmhaas@gmail.com> wrote:

So how close are we to having a committable version of this?  Should
we push this out to 9.2?

I think so. The feature is pretty attractive, but more works are required:
* Re-base on synchronized snapshots patch
* Consider to use pipe also on Windows.
* Research libpq + fork() issue. We have a warning in docs:
http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
| On Unix, forking a process with open libpq connections can lead to
unpredictable results

--
Itagaki Takahiro

#26Joachim Wieland
joe@mcknight.de
In reply to: Itagaki Takahiro (#25)
Re: pg_dump directory archive format / parallel pg_dump

On Tue, Feb 8, 2011 at 8:31 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

On Tue, Feb 8, 2011 at 13:34, Robert Haas <robertmhaas@gmail.com> wrote:

So how close are we to having a committable version of this?  Should
we push this out to 9.2?

I think so. The feature is pretty attractive, but more works are required:
 * Re-base on synchronized snapshots patch
 * Consider to use pipe also on Windows.
 * Research libpq + fork() issue. We have a warning in docs:
http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
| On Unix, forking a process with open libpq connections can lead to
unpredictable results

Just for the records, once the sync snapshot patch is committed, there
is no need to do fancy libpq + fork() combinations anyway.
Unfortunately, so far no committer has commented on the synchronized
snapshot patch at all.

I am not fighting for getting parallel pg_dump done in 9.1, as I don't
really have a personal use case for the patch. However it would be the
irony of the year if we shipped 9.1 with a synchronized snapshot patch
but no parallel dump :-)

Joachim

#27Robert Haas
robertmhaas@gmail.com
In reply to: Joachim Wieland (#26)
Re: pg_dump directory archive format / parallel pg_dump

On Tue, Feb 8, 2011 at 10:54 PM, Joachim Wieland <joe@mcknight.de> wrote:

On Tue, Feb 8, 2011 at 8:31 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

On Tue, Feb 8, 2011 at 13:34, Robert Haas <robertmhaas@gmail.com> wrote:

So how close are we to having a committable version of this?  Should
we push this out to 9.2?

I think so. The feature is pretty attractive, but more works are required:
 * Re-base on synchronized snapshots patch
 * Consider to use pipe also on Windows.
 * Research libpq + fork() issue. We have a warning in docs:
http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
| On Unix, forking a process with open libpq connections can lead to
unpredictable results

Just for the records, once the sync snapshot patch is committed, there
is no need to do fancy libpq + fork() combinations anyway.
Unfortunately, so far no committer has commented on the synchronized
snapshot patch at all.

I am not fighting for getting parallel pg_dump done in 9.1, as I don't
really have a personal use case for the patch. However it would be the
irony of the year if we shipped 9.1 with a synchronized snapshot patch
but no parallel dump  :-)

True. But it looks like there are some outstanding items from
previous reviews that you've yet to address, which makes pushing it
out seem fairly reasonable...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company