Add LZ4 compression in pg_dump

Started by Georgios Kokolatosabout 4 years ago32 messageshackers
Jump to latest
#1Georgios Kokolatos
gkokolatos@protonmail.com

Hi,

please find attached a patchset which adds lz4 compression in pg_dump.

The first commit does the heavy lifting required for additional compression methods.
It expands testing coverage for the already supported gzip compression. Commit
bf9aa490db introduced cfp in compress_io.{c,h} with the intent of unifying
compression related code and allow for the introduction of additional archive
formats. However, pg_backup_archiver.c was not using that API. This commit
teaches pg_backup_archiver.c about cfp and is using it through out.

Furthermore, compression was chosen based on the value of the level passed
as an argument during the invocation of pg_dump or some hardcoded defaults. This
does not scale for more than one compression methods. Now the method used for
compression can be explicitly requested during command invocation, or set during
hardcoded defaults. Then it is stored in the relevant structs and passed in the
relevant functions, along side compression level which has lost it's special
meaning. The method for compression is not yet stored in the actual archive.
This is done in the next commit which does introduce a new method.

The previously named CompressionAlgorithm enum is changed for
CompressionMethod so that it matches better similar variables found through out
the code base.

In a fashion similar to the binary for pg_basebackup, the method for compression
is passed using the already existing -Z/--compress parameter of pg_dump. The
legacy format and behaviour is maintained. Additionally, the user can explicitly
pass a requested method and optionaly the level to be used after a semicolon,e.g. --compress=gzip:6

The second commit adds LZ4 compression in pg_dump and pg_restore.

Within compress_io.{c,h} there are two distinct APIs exposed, the streaming API
and a file API. The first one, is aimed at inlined use cases and thus simple
lz4.h calls can be used directly. The second one is generating output, or is
parsing input, which can be read/generated via the lz4 utility.

In the later case, the API is using an opaque wrapper around a file stream,
which aquired via fopen() or gzopen() respectively. It would then provide
wrappers around fread(), fwrite(), fgets(), fgetc(), feof(), and fclose(); or
their gz equivallents. However the LZ4F api does not provide this functionality.
So this has been implemented localy.

In order to maintain the API compatibility a new structure LZ4File is
introduced. It is responsible for keeping state and any yet unused generated
content. The later is required when the generated decompressed output, exceeds
the caller's buffer capacity.

Custom compressed archives need to now store the compression method in their
header. This requires a bump in the version number. The level of compression is
still stored in the dump, though admittedly is of no apparent use.

The series is authored by me. Rachel Heaton helped out with the expansion
of the testing coverage, testing in different platforms and providing debug information
on those, as well as native speaker wording.

Cheers,
//Georgios

Attachments:

v1-0001-Prepare-pg_dump-for-additional-compression-method.patchtext/x-patch; name=v1-0001-Prepare-pg_dump-for-additional-compression-method.patchDownload+659-355
v1-0002-Add-LZ4-compression-in-pg_-dump-restore.patchtext/x-patch; name=v1-0002-Add-LZ4-compression-in-pg_-dump-restore.patchDownload+868-127
#2Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#1)
Re: Add LZ4 compression in pg_dump

On Fri, Feb 25, 2022 at 12:05:31PM +0000, Georgios wrote:

The first commit does the heavy lifting required for additional compression methods.
It expands testing coverage for the already supported gzip compression. Commit
bf9aa490db introduced cfp in compress_io.{c,h} with the intent of unifying
compression related code and allow for the introduction of additional archive
formats. However, pg_backup_archiver.c was not using that API. This commit
teaches pg_backup_archiver.c about cfp and is using it through out.

Thanks for the patch. I have a few high-level comments.

+   # Do not use --no-sync to give test coverage for data sync.
+   compression_gzip_directory_format => {
+       test_key => 'compression',

The tests for GZIP had better be split into their own commit, as
that's a coverage improvement for the existing code.

I was assuming that this was going to be much larger :)

+/* Routines that support LZ4 compressed data I/O */
+#ifdef HAVE_LIBLZ4
+static void InitCompressorLZ4(CompressorState *cs);
+static void ReadDataFromArchiveLZ4(ArchiveHandle *AH, ReadFunc
readF);
+static void WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
+                                 const char *data, size_t dLen);
+static void EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs);
+#endif

Hmm. This is the same set of APIs as ZLIB and NONE to init, read,
write and end, but for the LZ4 compressor (NONE has no init/end).
Wouldn't it be better to refactor the existing pg_dump code to have a
central structure holding all the function definitions in a common
structure so as all those function signatures are set in stone in the
shape of a catalog of callbacks, making the addition of more
compression formats easier? I would imagine that we'd split the code
of each compression method into their own file with their own context
data. This would lead to a removal of compress_io.c, with its entry
points ReadDataFromArchive(), WriteDataToArchive() & co replaced by
pointers to each per-compression callback.

Furthermore, compression was chosen based on the value of the level passed
as an argument during the invocation of pg_dump or some hardcoded defaults. This
does not scale for more than one compression methods. Now the method used for
compression can be explicitly requested during command invocation, or set during
hardcoded defaults. Then it is stored in the relevant structs and passed in the
relevant functions, along side compression level which has lost it's special
meaning. The method for compression is not yet stored in the actual archive.
This is done in the next commit which does introduce a new method.

That's one thing Robert was arguing about with pg_basebackup, so that
would be consistent, and the option set is backward-compatible as far
as I get it by reading the code.
--
Michael

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: Georgios Kokolatos (#1)
Re: Add LZ4 compression in pg_dump

The patch is failing on cfbot/freebsd.
http://cfbot.cputube.org/georgios-kokolatos.html

Also, I wondered if you'd looked at the "high compression" interfaces in
lz4hc.h ? Should pg_dump use that ?

Show quoted text

On Fri, Feb 25, 2022 at 08:03:40AM -0600, Justin Pryzby wrote:

Thanks for working on this. Your 0001 looks similar to what I did for zstd 1-2
years ago.
https://commitfest.postgresql.org/32/2888/

I rebased and attached the latest patches I had in case they're useful to you.
I'd like to see zstd included in pg_dump eventually, but it was too much work
to shepherd the patches. Now that seems reasonable for pg16.

With the other compression patches I've worked on, we've used an extra patch
with changes the default to the new compression algorithm, to force cfbot to
exercize the new code.

Do you know the process with commitfests and cfbot ?
There's also this, which allows running the tests on cirrus before mailing the
patch to the hackers list.
./src/tools/ci/README

#4Bruce Momjian
bruce@momjian.us
In reply to: Justin Pryzby (#3)
Re: Add LZ4 compression in pg_dump

It seems development on this has stalled. If there's no further work
happening I guess I'll mark the patch returned with feedback. Feel
free to resubmit it to the next CF when there's progress.

#5Justin Pryzby
pryzby@telsasoft.com
In reply to: Bruce Momjian (#4)
Re: Add LZ4 compression in pg_dump

On Fri, Mar 25, 2022 at 01:20:47AM -0400, Greg Stark wrote:

It seems development on this has stalled. If there's no further work
happening I guess I'll mark the patch returned with feedback. Feel
free to resubmit it to the next CF when there's progress.

Since it's a reasonably large patch (and one that I had myself started before)
and it's only been 20some days since (minor) review comments, and since the
focus right now is on committing features, and not reviewing new patches, and
this patch is new one month ago, and its 0002 not intended for pg15, therefor
I'm moving it to the next CF, where I hope to work with its authors to progress
it.

--
Justin

#6Rachel Heaton
rachelmheaton@gmail.com
In reply to: Justin Pryzby (#5)
Re: Add LZ4 compression in pg_dump

On Fri, Mar 25, 2022 at 6:22 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Mar 25, 2022 at 01:20:47AM -0400, Greg Stark wrote:

It seems development on this has stalled. If there's no further work
happening I guess I'll mark the patch returned with feedback. Feel
free to resubmit it to the next CF when there's progress.

Since it's a reasonably large patch (and one that I had myself started before)
and it's only been 20some days since (minor) review comments, and since the
focus right now is on committing features, and not reviewing new patches, and
this patch is new one month ago, and its 0002 not intended for pg15, therefor
I'm moving it to the next CF, where I hope to work with its authors to progress
it.

Hi Folks,

Here is an updated patchset from Georgios, with minor assistance from myself.
The comments above should be addressed, but please let us know if
there are other things to go over. A functional change in this
patchset is when `--compress=none` is passed to pg_dump, it will not
compress for directory type (previously, it would use gzip if
present). The previous default behavior is retained.

- Rachel

Attachments:

v2-0001-Extend-compression-coverage-for-pg_dump-pg_restor.patchapplication/octet-stream; name=v2-0001-Extend-compression-coverage-for-pg_dump-pg_restor.patchDownload+107-1
v2-0002-Remove-unsupported-bitrot-compression-from-pg_dum.patchapplication/octet-stream; name=v2-0002-Remove-unsupported-bitrot-compression-from-pg_dum.patchDownload+9-74
v2-0003-Prepare-pg_dump-for-additional-compression-method.patchapplication/octet-stream; name=v2-0003-Prepare-pg_dump-for-additional-compression-method.patchDownload+616-361
v2-0004-Add-LZ4-compression-in-pg_-dump-restore.patchapplication/octet-stream; name=v2-0004-Add-LZ4-compression-in-pg_-dump-restore.patchDownload+905-148
#7Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Rachel Heaton (#6)
Re: Add LZ4 compression in pg_dump

------- Original Message -------

On Saturday, March 26th, 2022 at 12:13 AM, Rachel Heaton <rachelmheaton@gmail.com> wrote:

On Fri, Mar 25, 2022 at 6:22 AM Justin Pryzby pryzby@telsasoft.com wrote:

On Fri, Mar 25, 2022 at 01:20:47AM -0400, Greg Stark wrote:

It seems development on this has stalled. If there's no further work
happening I guess I'll mark the patch returned with feedback. Feel
free to resubmit it to the next CF when there's progress.

We had some progress yet we didn't want to distract the list with too many
emails. Of course, it seemed stalled to the outside observer, yet I simply
wanted to set the record straight and say that we are actively working on it.

Since it's a reasonably large patch (and one that I had myself started before)
and it's only been 20some days since (minor) review comments, and since the
focus right now is on committing features, and not reviewing new patches, and
this patch is new one month ago, and its 0002 not intended for pg15, therefor
I'm moving it to the next CF, where I hope to work with its authors to progress
it.

Thank you. It is much appreciated. We will sent updates when the next commitfest
starts in July as to not distract from the 15 work. Then, we can take it from there.

Hi Folks,

Here is an updated patchset from Georgios, with minor assistance from myself.
The comments above should be addressed, but please let us know if

A small amendment to the above statement. This patchset does not include the
refactoring of compress_io suggested by Mr Paquier in the same thread, as it is
missing documentation. An updated version will be sent to include those changes
on the next commitfest.

Show quoted text

there are other things to go over. A functional change in this
patchset is when `--compress=none` is passed to pg_dump, it will not
compress for directory type (previously, it would use gzip if
present). The previous default behavior is retained.

- Rachel

#8Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#7)
Re: Add LZ4 compression in pg_dump

On Fri, Mar 25, 2022 at 11:43:17PM +0000, gkokolatos@pm.me wrote:

On Saturday, March 26th, 2022 at 12:13 AM, Rachel Heaton <rachelmheaton@gmail.com> wrote:

Here is an updated patchset from Georgios, with minor assistance from myself.
The comments above should be addressed, but please let us know if

A small amendment to the above statement. This patchset does not include the
refactoring of compress_io suggested by Mr Paquier in the same thread, as it is
missing documentation. An updated version will be sent to include those changes
on the next commitfest.

The refactoring using callbacks would make the code much cleaner IMO
in the long term, with zstd waiting in the queue. Now, I see some
pieces of the patch set that could be merged now without waiting for
the development cycle of 16 to begin, as of 0001 to add more tests and
0002.

I have a question about 0002, actually. What has led you to the
conclusion that this code is dead and could be removed?
--
Michael

#9Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#8)
Re: Add LZ4 compression in pg_dump

On Sat, Mar 26, 2022 at 02:57:50PM +0900, Michael Paquier wrote:

I have a question about 0002, actually. What has led you to the
conclusion that this code is dead and could be removed?

See 0001 and the manpage.

+ 'pg_dump: compression is not supported by tar archive format');

When I submitted a patch to support zstd, I spent awhile trying to make
compression work with tar, but it's a significant effort and better done
separately.

#10Justin Pryzby
pryzby@telsasoft.com
In reply to: Georgios Kokolatos (#1)
Re: Add LZ4 compression in pg_dump

LZ4F_HEADER_SIZE_MAX isn't defined in old LZ4.

I ran into that on an ubuntu LTS, so I don't think it's so old that it
shouldn't be handled more gracefully. LZ4 should either have an explicit
version check, or else shouldn't depend on that feature (or should define a
safe fallback version if the library header doesn't define it).

https://packages.ubuntu.com/liblz4-1

0003: typo: of legacy => or legacy

There are a large number of ifdefs being added here - it'd be nice to minimize
that. basebackup was organized to use separate files, which is one way.

$ git grep -c 'ifdef .*LZ4' src/bin/pg_dump/compress_io.c
src/bin/pg_dump/compress_io.c:19

In last year's CF entry, I had made a union within CompressorState. LZ4
doesn't need z_streamp (and ztsd will need ZSTD_outBuffer, ZSTD_inBuffer,
ZSTD_CStream).

0002: I wonder if you're able to re-use any of the basebackup parsing stuff
from commit ffd53659c. You're passing both the compression method *and* level.
I think there should be a structure which includes both. In the future, that
can also handle additional options. I hope to re-use these same things for
wal_compression=method:level.

You renamed this:

|- COMPR_ALG_LIBZ
|-} CompressionAlgorithm;
|+ COMPRESSION_GZIP,
|+} CompressionMethod;

..But I don't think that's an improvement. If you were to change it, it should
say something like PGDUMP_COMPRESS_ZLIB, since there are other compression
structs and typedefs. zlib is not idential to gzip, which uses a different
header, so in WriteDataToArchive(), LIBZ is correct, and GZIP is incorrect.

The cf* changes in pg_backup_archiver could be split out into a separate
commit. It's strictly a code simplification - not just preparation for more
compression algorithms. The commit message should "See also:
bf9aa490db24b2334b3595ee33653bf2fe39208c".

The changes in 0002 for cfopen_write seem insufficient:
|+ if (compressionMethod == COMPRESSION_NONE)
|+ fp = cfopen(path, mode, compressionMethod, 0);
| else
| {
| #ifdef HAVE_LIBZ
| char *fname;
|
| fname = psprintf("%s.gz", path);
|- fp = cfopen(fname, mode, compression);
|+ fp = cfopen(fname, mode, compressionMethod, compressionLevel);
| free_keep_errno(fname);
| #else

The only difference between the LIBZ and uncompressed case is the file
extension, and it'll be the only difference with LZ4 too. So I suggest to
first handle the file extension, and the rest of the code path is not
conditional on the compression method. I don't think cfopen_write even needs
HAVE_LIBZ - can't you handle that in cfopen_internal() ?

This patch rejects -Z0, which ought to be accepted:
./src/bin/pg_dump/pg_dump -h /tmp regression -Fc -Z0 |wc
pg_dump: error: can only specify -Z/--compress [LEVEL] when method is set

Your 0003 patch shouldn't reference LZ4:
+#ifndef HAVE_LIBLZ4
+       if (*compressionMethod == COMPRESSION_LZ4)
+               supports_compression = false;
+#endif

The 0004 patch renames zlibOutSize to outsize - I think the patch series should
be constructed such as to minimize the size of the method-specific patches. I
say this anticipating also adding support for zstd. The preliminary patches
should have all the boring stuff. It would help for reviewing to keep the
patches split up, or to enumerate all the boring things that are being renamed
(like change OutputContext to cfp, rename zlibOutSize, ...).

0004: The include should use <lz4.h> and not "lz4.h"

freebsd/cfbot is failing.

I suggested off-list to add an 0099 patch to change LZ4 to the default, to
exercise it more on CI.

--
Justin

#11Justin Pryzby
pryzby@telsasoft.com
In reply to: Justin Pryzby (#10)
Re: Add LZ4 compression in pg_dump

On Sat, Mar 26, 2022 at 11:21:56AM -0500, Justin Pryzby wrote:

You're passing both the compression method *and* level. I think there should
be a structure which includes both. In the future, that can also handle
additional options.

I'm not sure if there's anything worth saving, but I did that last year with
0003-Support-multiple-compression-algs-levels-opts.patch
I sent a rebased copy off-list.
/messages/by-id/20210104025321.GA9712@telsasoft.com

| fatal("not built with LZ4 support");
| fatal("not built with lz4 support");

Please use consistent capitalization of "lz4" - then the compiler can optimize
away duplicate strings.

0004: The include should use <lz4.h> and not "lz4.h"

Also, use USE_LZ4 rather than HAVE_LIBLZ4, per 75eae0908.

#12Daniel Gustafsson
daniel@yesql.se
In reply to: Justin Pryzby (#10)
Re: Add LZ4 compression in pg_dump

On 26 Mar 2022, at 17:21, Justin Pryzby <pryzby@telsasoft.com> wrote:

I suggested off-list to add an 0099 patch to change LZ4 to the default, to
exercise it more on CI.

No need to change the defaults in autoconf for that. The CFBot uses the cirrus
file in the tree so changing what the job includes can be easily done (assuming
the CFBot hasn't changed this recently which I think it hasn't). I used that
trick in the NSS patchset to add a completely new job for --with-ssl=nss beside
the --with-ssl=openssl job.

--
Daniel Gustafsson https://vmware.com/

#13Justin Pryzby
pryzby@telsasoft.com
In reply to: Daniel Gustafsson (#12)
Re: Add LZ4 compression in pg_dump

On Sun, Mar 27, 2022 at 12:37:27AM +0100, Daniel Gustafsson wrote:

On 26 Mar 2022, at 17:21, Justin Pryzby <pryzby@telsasoft.com> wrote:

I suggested off-list to add an 0099 patch to change LZ4 to the default, to
exercise it more on CI.

No need to change the defaults in autoconf for that. The CFBot uses the cirrus
file in the tree so changing what the job includes can be easily done (assuming
the CFBot hasn't changed this recently which I think it hasn't). I used that
trick in the NSS patchset to add a completely new job for --with-ssl=nss beside
the --with-ssl=openssl job.

I think you misunderstood - I'm suggesting not only to use with-lz4 (which was
always true since 93d973494), but to change pg_dump -Fc and -Fd to use LZ4 by
default (the same as I suggested for toast_compression, wal_compression, and
again in last year's patch to add zstd compression to pg_dump, for which
postgres was not ready).

@@ -781,6 +807,11 @@ main(int argc, char **argv)
                        compress.alg = COMPR_ALG_LIBZ;
                        compress.level = Z_DEFAULT_COMPRESSION;
 #endif
+
+#ifdef USE_ZSTD
+                       compress.alg = COMPR_ALG_ZSTD; // Set default for testing purposes
+                       compress.level = ZSTD_CLEVEL_DEFAULT;
+#endif
#14Robert Haas
robertmhaas@gmail.com
In reply to: Justin Pryzby (#10)
Re: Add LZ4 compression in pg_dump

On Sat, Mar 26, 2022 at 12:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

0002: I wonder if you're able to re-use any of the basebackup parsing stuff
from commit ffd53659c. You're passing both the compression method *and* level.
I think there should be a structure which includes both. In the future, that
can also handle additional options. I hope to re-use these same things for
wal_compression=method:level.

Yeah, we should really try to use that infrastructure instead of
inventing a bunch of different ways to do it. It might require some
renaming here and there, and I'm not sure whether we really want to
try to rush all this into the current release, but I think we should
find a way to get it done.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Justin Pryzby
pryzby@telsasoft.com
In reply to: Robert Haas (#14)
Re: Add LZ4 compression in pg_dump

On Sun, Mar 27, 2022 at 10:13:00AM -0400, Robert Haas wrote:

On Sat, Mar 26, 2022 at 12:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

0002: I wonder if you're able to re-use any of the basebackup parsing stuff
from commit ffd53659c. You're passing both the compression method *and* level.
I think there should be a structure which includes both. In the future, that
can also handle additional options. I hope to re-use these same things for
wal_compression=method:level.

Yeah, we should really try to use that infrastructure instead of
inventing a bunch of different ways to do it. It might require some
renaming here and there, and I'm not sure whether we really want to
try to rush all this into the current release, but I think we should
find a way to get it done.

It seems like something a whole lot like parse_compress_options() should be in
common/. Nobody wants to write it again, and I couldn't convince myself to
copy it when I looked at using it for wal_compression.

Maybe it should take an argument which specifies the default algorithm to use
for input of a numeric "level". And reject such input if not specified, since
wal_compression has never taken a "level", so it's not useful or desirable to
have that default to some new algorithm.

I could write this down if you want, although I'm not sure how/if you intend
other people to use bc_algorithm and bc_algorithm. I don't think it's
important to do for v15, but it seems like it could be done after featue
freeze. pg_dump+lz4 is targetting v16, although there's a cleanup patch that
could also go in before branching.

--
Justin

#16Daniel Gustafsson
daniel@yesql.se
In reply to: Justin Pryzby (#13)
Re: Add LZ4 compression in pg_dump

On 27 Mar 2022, at 00:51, Justin Pryzby <pryzby@telsasoft.com> wrote:

On Sun, Mar 27, 2022 at 12:37:27AM +0100, Daniel Gustafsson wrote:

On 26 Mar 2022, at 17:21, Justin Pryzby <pryzby@telsasoft.com> wrote:

I suggested off-list to add an 0099 patch to change LZ4 to the default, to
exercise it more on CI.

No need to change the defaults in autoconf for that. The CFBot uses the cirrus
file in the tree so changing what the job includes can be easily done (assuming
the CFBot hasn't changed this recently which I think it hasn't). I used that
trick in the NSS patchset to add a completely new job for --with-ssl=nss beside
the --with-ssl=openssl job.

I think you misunderstood - I'm suggesting not only to use with-lz4 (which was
always true since 93d973494), but to change pg_dump -Fc and -Fd to use LZ4 by
default (the same as I suggested for toast_compression, wal_compression, and
again in last year's patch to add zstd compression to pg_dump, for which
postgres was not ready).

Right, I clearly misunderstood, thanks for the clarification.

--
Daniel Gustafsson https://vmware.com/

#17Robert Haas
robertmhaas@gmail.com
In reply to: Justin Pryzby (#15)
Re: Add LZ4 compression in pg_dump

On Sun, Mar 27, 2022 at 12:06 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

Maybe it should take an argument which specifies the default algorithm to use
for input of a numeric "level". And reject such input if not specified, since
wal_compression has never taken a "level", so it's not useful or desirable to
have that default to some new algorithm.

That sounds odd to me. Wouldn't it be rather confusing if a bare
integer meant gzip for one case and lz4 for another?

I could write this down if you want, although I'm not sure how/if you intend
other people to use bc_algorithm and bc_algorithm. I don't think it's
important to do for v15, but it seems like it could be done after featue
freeze. pg_dump+lz4 is targetting v16, although there's a cleanup patch that
could also go in before branching.

Well, I think the first thing we should do is get rid of enum
WalCompressionMethod and use enum WalCompression instead. They've got
the same elements and very similar names, but the WalCompressionMethod
ones just have names like COMPRESSION_NONE, which is too generic,
whereas WalCompressionMethod uses WAL_COMPRESSION_NONE, which is
better. Then I think we should also rename the COMPR_ALG_* constants
in pg_dump.h to names like DUMP_COMPRESSION_*. Once we do that we've
got rid of all the unprefixed things that purport to be a list of
compression algorithms.

Then, if people are willing to adopt the syntax that the
backup_compression.c/h stuff supports as a project standard (+1 from
me) we can go the other way and rename that stuff to be more generic,
taking backup out of the name.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#17)
Re: Add LZ4 compression in pg_dump

On Mon, Mar 28, 2022 at 08:36:15AM -0400, Robert Haas wrote:

Well, I think the first thing we should do is get rid of enum
WalCompressionMethod and use enum WalCompression instead. They've got
the same elements and very similar names, but the WalCompressionMethod
ones just have names like COMPRESSION_NONE, which is too generic,
whereas WalCompressionMethod uses WAL_COMPRESSION_NONE, which is
better. Then I think we should also rename the COMPR_ALG_* constants
in pg_dump.h to names like DUMP_COMPRESSION_*. Once we do that we've
got rid of all the unprefixed things that purport to be a list of
compression algorithms.

Yes, having a centralized enum for the compression method would make
sense, along with the routines to parse and get the compression method
names. At least that would be one step towards more unity in
src/common/.

Then, if people are willing to adopt the syntax that the
backup_compression.c/h stuff supports as a project standard (+1 from
me) we can go the other way and rename that stuff to be more generic,
taking backup out of the name.

I am not sure about the specification part which is only used by base
backups that has no client-server requirements, so option values would
still require their own grammar.
--
Michael

#19Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#9)
Re: Add LZ4 compression in pg_dump

On Sat, Mar 26, 2022 at 01:14:41AM -0500, Justin Pryzby wrote:

See 0001 and the manpage.

+ 'pg_dump: compression is not supported by tar archive format');

When I submitted a patch to support zstd, I spent awhile trying to make
compression work with tar, but it's a significant effort and better done
separately.

Wow. This stuff is old enough to vote (c3e18804), dead since its
introduction. There is indeed an argument for removing that, it is
not good to keep around that that has never been stressed and/or
used. Upon review, the cleanup done looks correct, as we have never
been able to generate .dat.gz files in for a dump in the tar format.

+   command_fails_like(
+       [ 'pg_dump', '--compress', '1', '--format', 'tar' ],
This addition depending on HAVE_LIBZ is a good thing as a reminder of
any work that could be done in 0002.  Now that's waiting for 20 years
so I would not hold my breath on this support.  I think that this
could be just applied first, with 0002 on top of it, as a first
improvement.
+       compress_cmd => [
+           $ENV{'GZIP_PROGRAM'},
Patch 0001 is missing and update of pg_dump's Makefile to pass down
this environment variable to the test scripts, no?
+       compress_cmd => [
+           $ENV{'GZIP_PROGRAM'},
+           '-f',
[...]
+           $ENV{'GZIP_PROGRAM'},
+           '-k', '-d',
-f and -d are available everywhere I looked at, but is -k/--keep a
portable choice with a gzip command?  I don't see this option in
OpenBSD, for one.  So this test is going to cause problems on those
buildfarm machines, at least.  Couldn't this part be replaced by a
simple --test to check that what has been compressed is in correct
shape?  We know that this works, based on our recent experiences with
the other tests.
--
Michael
#20Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#19)
Re: Add LZ4 compression in pg_dump

------- Original Message -------

On Tuesday, March 29th, 2022 at 9:27 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Mar 26, 2022 at 01:14:41AM -0500, Justin Pryzby wrote:

See 0001 and the manpage.
+ 'pg_dump: compression is not supported by tar archive format');
When I submitted a patch to support zstd, I spent awhile trying to make
compression work with tar, but it's a significant effort and better done
separately.

Wow. This stuff is old enough to vote (c3e18804), dead since its
introduction. There is indeed an argument for removing that, it is
not good to keep around that that has never been stressed and/or
used. Upon review, the cleanup done looks correct, as we have never
been able to generate .dat.gz files in for a dump in the tar format.

Correct. My driving force behind it was to ease up the cleanup/refactoring
work that follows, by eliminating the callers of the GZ*() macros.

+ command_fails_like(

+ [ 'pg_dump', '--compress', '1', '--format', 'tar' ],
This addition depending on HAVE_LIBZ is a good thing as a reminder of
any work that could be done in 0002. Now that's waiting for 20 years
so I would not hold my breath on this support. I think that this
could be just applied first, with 0002 on top of it, as a first
improvement.

Excellent, thank you.

+ compress_cmd => [
+ $ENV{'GZIP_PROGRAM'},
Patch 0001 is missing and update of pg_dump's Makefile to pass down
this environment variable to the test scripts, no?

Agreed. It was not properly moved forward. Fixed.

+ compress_cmd => [
+ $ENV{'GZIP_PROGRAM'},
+ '-f',
[...]
+ $ENV{'GZIP_PROGRAM'},
+ '-k', '-d',
-f and -d are available everywhere I looked at, but is -k/--keep a
portable choice with a gzip command? I don't see this option in
OpenBSD, for one. So this test is going to cause problems on those
buildfarm machines, at least. Couldn't this part be replaced by a
simple --test to check that what has been compressed is in correct
shape? We know that this works, based on our recent experiences with
the other tests.

I would argue that the simple '--test' will not do in this case, as the
TAP tests do need a file named <test>.sql to compare the contents with.
This file is generated either directly by pg_dump itself, or by running
pg_restore on pg_dump's output. In the case of compression pg_dump will
generate a <test>.sql.<compression program suffix> which can not be
used in the comparison tests. So the intention of this block, is not to
simply test for validity, but to also decompress pg_dump's output for it
to be able to be used.

I updated the patch to simply remove the '-k' flag.

Please find v3 attached. (only 0001 and 0002 are relevant, 0003 and
0004 are only for reference and are currently under active modification).

Cheers,
//Georgios

Attachments:

v3-0004-Add-LZ4-compression-in-pg_-dump-restore.patchtext/x-patch; name=v3-0004-Add-LZ4-compression-in-pg_-dump-restore.patchDownload+911-150
v3-0003-Prepare-pg_dump-for-additional-compression-method.patchtext/x-patch; name=v3-0003-Prepare-pg_dump-for-additional-compression-method.patchDownload+608-358
v3-0001-Extend-compression-coverage-for-pg_dump-pg_restor.patchtext/x-patch; name=v3-0001-Extend-compression-coverage-for-pg_dump-pg_restor.patchDownload+109-1
v3-0002-Remove-unsupported-bitrot-compression-from-pg_dum.patchtext/x-patch; name=v3-0002-Remove-unsupported-bitrot-compression-from-pg_dum.patchDownload+9-74
#21Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#18)
#22Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#20)
#23Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#21)
#24Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#22)
#25Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#24)
#26Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#26)
#28Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#28)
#30Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#27)
#32Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#31)