Refactoring of compression options in pg_basebackup

Started by Michael Paquierover 4 years ago42 messageshackers
Jump to latest
#1Michael Paquier
michael@paquier.xyz

Hi all,
(Added Georgios in CC.)

When working on the support of LZ4 for pg_receivewal, walmethods.c has
gained an extra parameter for the compression method. It gets used on
the DirectoryMethod instead of the compression level to decide which
type of compression is used. One thing that I left out during this
previous work is that the TarMethod also gained knowledge of this
compression method, but we still use the compression level to check if
tars should be compressed or not.

This is wrong on multiple aspects. First, this is not consistent with
the directory method, making walmethods.c harder to figure out.
Second, this is not extensible if we want to introduce more
compression methods in pg_basebackup, like LZ4. This reflects on the
options used by pg_receivewal and pg_basebackup, that are not
inconsistent as well.

The attached patch refactors the code of pg_basebackup and the
TarMethod of walmethods.c to use the compression method where it
should, splitting entirely the logic related the compression level.

This is one step toward the introduction of LZ4 in pg_basebackup, but
this refactoring is worth doing on its own, hence a separate thread to
deal with this problem first. The options of pg_basebackup are
reworked to be consistent with pg_receivewal, as follows:
- --compress ranges now from 1 to 9, instead of 0 to 9.
- --compression-method={none,gzip} is added, the default is none, same
as HEAD.
- --gzip/-z has the same meaning as before, being just a synonym of
--compression-method=gzip with the default compression level of ZLIB
assigned if there is no --compress.

One more thing that I have noticed while hacking this stuff is that we
have no regression tests for gzip with pg_basebackup, so I have added
some that are skipped when not compiling the code with ZLIB.

Opinions?
--
Michael

Attachments:

0001-Refactor-options-of-pg_basebackup-for-compression-le.patchtext/x-diff; charset=us-asciiDownload+155-41
#2Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#1)
Re: Refactoring of compression options in pg_basebackup

On Sat, Dec 18, 2021 at 6:29 AM Michael Paquier <michael@paquier.xyz> wrote:

This is one step toward the introduction of LZ4 in pg_basebackup, but
this refactoring is worth doing on its own, hence a separate thread to
deal with this problem first. The options of pg_basebackup are
reworked to be consistent with pg_receivewal, as follows:
- --compress ranges now from 1 to 9, instead of 0 to 9.
- --compression-method={none,gzip} is added, the default is none, same
as HEAD.
- --gzip/-z has the same meaning as before, being just a synonym of
--compression-method=gzip with the default compression level of ZLIB
assigned if there is no --compress.

One thing we should keep in mind is that I'm also working on adding
server-side compression, initially with gzip, but Jeevan Ladhe has
posted patches to extend that to LZ4. So however we structure the
options they should take that into account. Currently that patch set
adds --server-compression={none,gzip,gzipN} where N from 1 to 9, but
perhaps it should be done differently. Not sure.

One more thing that I have noticed while hacking this stuff is that we
have no regression tests for gzip with pg_basebackup, so I have added
some that are skipped when not compiling the code with ZLIB.

If they don't decompress the backup and run pg_verifybackup on it then
I'm not sure how much they help. Yet, I don't know how to do that
portably.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#2)
Re: Refactoring of compression options in pg_basebackup

On Mon, Dec 20, 2021 at 10:19:44AM -0500, Robert Haas wrote:

One thing we should keep in mind is that I'm also working on adding
server-side compression, initially with gzip, but Jeevan Ladhe has
posted patches to extend that to LZ4. So however we structure the
options they should take that into account. Currently that patch set
adds --server-compression={none,gzip,gzipN} where N from 1 to 9, but
perhaps it should be done differently. Not sure.

Yeah, consistency would be good. For the client-side compression of
LZ4, we have shaped things around the existing --compress option, and
there is 6f164e6 that offers an API to parse that at option-level,
meaning less custom error strings. I'd like to think that splitting
the compression level and the compression method is still the right
choice, except if --server-compression combined with a client-side
compression is a legal combination. This would not really make sense,
though, no? So we'd better block this possibility from the start?

One more thing that I have noticed while hacking this stuff is that we
have no regression tests for gzip with pg_basebackup, so I have added
some that are skipped when not compiling the code with ZLIB.

If they don't decompress the backup and run pg_verifybackup on it then
I'm not sure how much they help. Yet, I don't know how to do that
portably.

They help in checking that an environment does not use a buggy set of
GZIP, at least. Using pg_verifybackup on a base backup with
--format='t' could be tweaked with $ENV{TAR} for the tarballs
generation, for example, as we do in some other tests. Option sets
like "xvf" or "zxvf" should be rather portable across the buildfarm,
no? I'd like to think that this is not a requirement for adding
checks in the compression path, as a first step, though, but I agree
that it could be extended more.
--
Michael

#4Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#1)
Re: Refactoring of compression options in pg_basebackup

Hi,

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, December 18th, 2021 at 12:29 PM, Michael Paquier
<michael@paquier.xyz> wrote:

Hi all,
(Added Georgios in CC.)

thank you for the shout out.

When working on the support of LZ4 for pg_receivewal, walmethods.c has
gained an extra parameter for the compression method. It gets used on
the DirectoryMethod instead of the compression level to decide which
type of compression is used. One thing that I left out during this
previous work is that the TarMethod also gained knowledge of this
compression method, but we still use the compression level to check if
tars should be compressed or not.

This is wrong on multiple aspects. First, this is not consistent with
the directory method, making walmethods.c harder to figure out.
Second, this is not extensible if we want to introduce more
compression methods in pg_basebackup, like LZ4. This reflects on the
options used by pg_receivewal and pg_basebackup, that are not
inconsistent as well.

Agreed with all the above.

The attached patch refactors the code of pg_basebackup and the
TarMethod of walmethods.c to use the compression method where it
should, splitting entirely the logic related the compression level.

Thanks.

This is one step toward the introduction of LZ4 in pg_basebackup, but
this refactoring is worth doing on its own, hence a separate thread to
deal with this problem first. The options of pg_basebackup are
reworked to be consistent with pg_receivewal, as follows:
- --compress ranges now from 1 to 9, instead of 0 to 9.
- --compression-method={none,gzip} is added, the default is none, same
as HEAD.
- --gzip/-z has the same meaning as before, being just a synonym of
--compression-method=gzip with the default compression level of ZLIB
assigned if there is no --compress.

Indeed this is consistent with pg_receivewal. It gets my +1.

One more thing that I have noticed while hacking this stuff is that we
have no regression tests for gzip with pg_basebackup, so I have added
some that are skipped when not compiling the code with ZLIB.

As far as the code is concerned, I have a minor nitpick.

+       if (compression_method == COMPRESSION_NONE)
+           streamer = bbstreamer_plain_writer_new(archive_filename,
+                                                  archive_file);
 #ifdef HAVE_LIBZ
-       if (compresslevel != 0)
+       else if (compression_method == COMPRESSION_GZIP)
        {
            strlcat(archive_filename, ".gz", sizeof(archive_filename));
            streamer = bbstreamer_gzip_writer_new(archive_filename,
                                                  archive_file,
                                                  compresslevel);
        }
-       else
 #endif
-           streamer = bbstreamer_plain_writer_new(archive_filename,
-                                                  archive_file);
-
+       else
+       {
+           Assert(false);      /* not reachable */
+       }

The above block moves the initialization of 'streamer' within two conditional
blocks. Despite this being correct, it is possible that some compilers will
complain for lack of initialization of 'streamer' when it is eventually used a
bit further ahead in:
if (must_parse_archive)
streamer = bbstreamer_tar_archiver_new(streamer);

I propose to initialize streamer to NULL when declared at the top of
CreateBackupStreamer().

As far as the tests are concerned, I think that 2 too many tests are skipped
when HAVE_LIBZ is not defined to be 1. The patch reads:

+Check ZLIB compression if available.
+SKIP:
+{
+   skip "postgres was not built with ZLIB support", 5
+     if (!check_pg_config("#define HAVE_LIBZ 1"));
+
+   $node->command_ok(
+       [
+           'pg_basebackup',        '-D',
+           "$tempdir/backup_gzip", '--compression-method',
+           'gzip', '--compress', '1', '--no-sync', '--format', 't'
+       ],
+       'pg_basebackup with --compress and --compression-method=gzip');
+
+   # Verify that the stored files are generated with their expected
+   # names.
+   my @zlib_files = glob "$tempdir/backup_gzip/*.tar.gz";
+   is(scalar(@zlib_files), 2,
+       "two files created with gzip (base.tar.gz and pg_wal.tar.gz)");
+
+   # Check the integrity of the files generated.
+   my $gzip = $ENV{GZIP_PROGRAM};
+   skip "program gzip is not found in your system", 1
+     if ( !defined $gzip
+       || $gzip eq ''
+       || system_log($gzip, '--version') != 0);
+
+   my $gzip_is_valid = system_log($gzip, '--test', @zlib_files);
+   is($gzip_is_valid, 0, "gzip verified the integrity of compressed data");
+   rmtree("$tempdir/backup_gzip");
+}

You can see that after the check_pg_config() test, only 3 tests follow,
namely:
* $node->command_ok()
* is(scalar(), ...)
* is($gzip_is_valid, ...)

Opinions?

Other than the minor issues above, I think this is a solid improvement. +1

--
Michael

Cheers,
//Georgios

#5Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#4)
Re: Refactoring of compression options in pg_basebackup

(Added Jeevan in CC, for awareness)

On Mon, Jan 03, 2022 at 03:35:57PM +0000, gkokolatos@pm.me wrote:

I propose to initialize streamer to NULL when declared at the top of
CreateBackupStreamer().

Yes, that may be noisy. Done this way.

You can see that after the check_pg_config() test, only 3 tests follow,
namely:
* $node->command_ok()
* is(scalar(), ...)
* is($gzip_is_valid, ...)

Indeed. The CF bot was complaining about that, actually.

Thinking more about this stuff, pg_basebackup --compress is an option
that exists already for a couple of years, and that's independent of
the backend-side compression that Robert and Jeevan are working on, so
I'd like to move on this code cleanup. We can always figure out the
LZ4 part for pg_basebackup after, if necessary.

Attached is an updated patch. The CF bot should improve with that.
--
Michael

Attachments:

v2-0001-Refactor-options-of-pg_basebackup-for-compression.patchtext/x-diff; charset=us-asciiDownload+156-42
#6Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#5)
Re: Refactoring of compression options in pg_basebackup

On Wednesday, January 5th, 2022 at 9:00 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jan 03, 2022 at 03:35:57PM +0000, gkokolatos@pm.me wrote:

I propose to initialize streamer to NULL when declared at the top of
CreateBackupStreamer().

Yes, that may be noisy. Done this way.

Great.

You can see that after the check_pg_config() test, only 3 tests follow,
namely:
* $node->command_ok()
* is(scalar(), ...)
* is($gzip_is_valid, ...)

Indeed. The CF bot was complaining about that, actually.

Great.

Thinking more about this stuff, pg_basebackup --compress is an option
that exists already for a couple of years, and that's independent of
the backend-side compression that Robert and Jeevan are working on, so
I'd like to move on this code cleanup. We can always figure out the
LZ4 part for pg_basebackup after, if necessary.

I agree that the cleanup in itself is helpful. It feels awkward to have two
utilities under the same path, with distinct options for the same
functionality.

When the backend-side compression is completed, were there really be a need for
client-side compression? If yes, then it seems logical to have distinct options
for them and this cleanup makes sense. If not, then it seems logical to maintain
the current options list and 'simply' change the internals of the code, and this
cleanup makes sense.

Attached is an updated patch. The CF bot should improve with that.

+1

--
Michael

Cheers,
//Georgios

#7Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#3)
Re: Refactoring of compression options in pg_basebackup

On Mon, Dec 20, 2021 at 7:43 PM Michael Paquier <michael@paquier.xyz> wrote:

Yeah, consistency would be good. For the client-side compression of
LZ4, we have shaped things around the existing --compress option, and
there is 6f164e6 that offers an API to parse that at option-level,
meaning less custom error strings. I'd like to think that splitting
the compression level and the compression method is still the right
choice, except if --server-compression combined with a client-side
compression is a legal combination. This would not really make sense,
though, no? So we'd better block this possibility from the start?

Right. It's blocked right now, but Tushar noticed on the other thread
that the error message isn't as good as it could be, so I'll improve
that a bit. Still the issue wasn't overlooked.

If they don't decompress the backup and run pg_verifybackup on it then
I'm not sure how much they help. Yet, I don't know how to do that
portably.

They help in checking that an environment does not use a buggy set of
GZIP, at least. Using pg_verifybackup on a base backup with
--format='t' could be tweaked with $ENV{TAR} for the tarballs
generation, for example, as we do in some other tests. Option sets
like "xvf" or "zxvf" should be rather portable across the buildfarm,
no? I'd like to think that this is not a requirement for adding
checks in the compression path, as a first step, though, but I agree
that it could be extended more.

Oh, well, if we have a working tar available, then it's not so bad. I
was thinking we couldn't really count on that, especially on Windows.

--
Robert Haas
EDB: http://www.enterprisedb.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: Refactoring of compression options in pg_basebackup

Robert Haas <robertmhaas@gmail.com> writes:

Oh, well, if we have a working tar available, then it's not so bad. I
was thinking we couldn't really count on that, especially on Windows.

I think the existing precedent is to skip the test if tar isn't there,
cf pg_basebackup/t/010_pg_basebackup.pl. But certainly the majority of
buildfarm animals have it.

regards, tom lane

#9Robert Haas
robertmhaas@gmail.com
In reply to: Georgios Kokolatos (#6)
Re: Refactoring of compression options in pg_basebackup

On Wed, Jan 5, 2022 at 4:17 AM <gkokolatos@pm.me> wrote:

When the backend-side compression is completed, were there really be a need for
client-side compression? If yes, then it seems logical to have distinct options
for them and this cleanup makes sense. If not, then it seems logical to maintain
the current options list and 'simply' change the internals of the code, and this
cleanup makes sense.

I think we're going to want to offer both options. We can't know
whether the user prefers to consume CPU cycles on the server or on the
client. Compressing on the server has the advantage of potentially
saving transfer bandwidth, but the server is also often the busiest
part of the whole system, and users are often keen to offload as much
work as possible.

Given that, I'd like us to be thinking about what the full set of
options looks like once we have (1) compression on either the server
or the client and (2) multiple compression algorithms and (3) multiple
compression levels. Personally, I don't really like the decision made
by this proposed patch. In this patch's view of the world, -Z is a way
of providing the compression level for whatever compression algorithm
you happen to have selected, but I think of -Z as being the upper-case
version of -z which I think of as selecting specifically gzip. It's
not particularly intuitive to me that in a command like pg_basebackup
--compress=<something>, <something> is a compression level rather than
an algorithm. So what I would propose is probably something like:

pg_basebackup --compress=ALGORITHM [--compression-level=NUMBER]
pg_basebackup --server-compress=ALGORITHM [--compression-level=NUMBER]

And then make -z short for --compress=gzip and -Z <n> short for
--compress=gzip --compression-level=<n>. That would be a
backward-incompatible change to the definition of --compress, but as
long as -Z <n> works the same as today, I don't think many people will
notice. If we like, we can notice if the argument to --compress is an
integer and suggest using either -Z or --compress=gzip
--compression-level=<n> instead.

In the proposed patch, you end up with pg_basebackup
--compression-method=lz4 -Z2 meaning compression with lz4 level 2. I
find that quite odd, though as with all such things, opinions may
vary. In my proposal, that would be an error, because it would be
equivalent to --compress=lz4 --compress=gzip --compression-level=2,
and would thus involve conflicting compression method specifications.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#9)
Re: Refactoring of compression options in pg_basebackup

On Wed, Jan 05, 2022 at 10:33:38AM -0500, Robert Haas wrote:

I think we're going to want to offer both options. We can't know
whether the user prefers to consume CPU cycles on the server or on the
client. Compressing on the server has the advantage of potentially
saving transfer bandwidth, but the server is also often the busiest
part of the whole system, and users are often keen to offload as much
work as possible.

Yeah. There are cases for both. I just got to wonder whether it
makes sense to allow both server-side and client-side compression to
be used at the same time. That would be a rather strange case, but
well, with the correct set of options that could be possible.

Given that, I'd like us to be thinking about what the full set of
options looks like once we have (1) compression on either the server
or the client and (2) multiple compression algorithms and (3) multiple
compression levels. Personally, I don't really like the decision made
by this proposed patch. In this patch's view of the world, -Z is a way
of providing the compression level for whatever compression algorithm
you happen to have selected, but I think of -Z as being the upper-case
version of -z which I think of as selecting specifically gzip. It's
not particularly intuitive to me that in a command like pg_basebackup
--compress=<something>, <something> is a compression level rather than
an algorithm. So what I would propose is probably something like:

pg_basebackup --compress=ALGORITHM [--compression-level=NUMBER]
pg_basebackup --server-compress=ALGORITHM [--compression-level=NUMBER]

And then make -z short for --compress=gzip and -Z <n> short for
--compress=gzip --compression-level=<n>. That would be a
backward-incompatible change to the definition of --compress, but as
long as -Z <n> works the same as today, I don't think many people will
notice. If we like, we can notice if the argument to --compress is an
integer and suggest using either -Z or --compress=gzip
--compression-level=<n> instead.

My view of things is slightly different, aka I'd rather keep
--compress to mean a compression level with an integer option, but
introduce a --compression-method={lz4,gzip,none}, with -Z being a
synonym of --compression-method=gzip. That's at least the path we
chose for pg_receivewal. I don't mind sticking with one way or
another, as what you are proposing is basically the same thing I have
in mind, but both tools ought to use the same set of options.

Hmm. Perhaps at the end the problem is with --compress, where we
don't know if it means a compression level or a compression method?
For me, --compress means the former, and for you the latter. So a
third way of seeing things is to drop completely --compress, but have
one --compression-method and one --compression-level. That would
bring a clear split. Or just one --compression-method for the
client-side compression as you are proposing for the server-side
compression, however I'd like to think that a split between the method
and level is more intuitive.

In the proposed patch, you end up with pg_basebackup
--compression-method=lz4 -Z2 meaning compression with lz4 level 2. I
find that quite odd, though as with all such things, opinions may
vary. In my proposal, that would be an error, because it would be
equivalent to --compress=lz4 --compress=gzip --compression-level=2,
and would thus involve conflicting compression method specifications.

It seems to me that you did not read the patch closely enough. The
attached patch does not add support for LZ4 in pg_basebackup on the
client-side yet. Once it gets added, though, the idea is that using
--compress with LZ4 would result in an error. That's what happens
with pg_receivewal on HEAD, for one. The patch just shapes things to
plug LZ4 more easily in the existing code of pg_basebackup.c, and
walmethods.c.

So.. As of now, it is actually possible to cut the pie in three
parts. There are no real objections to the cleanup of walmethods.c
and the addition of some conditional TAP tests with pg_basebackup and
client-side compression, as far as I can see, only to the option
renaming part. Attached are two patches, then. 0001 is the cleanup
of walmethods.c to rely the compression method, with more tests (tests
that could be moved into their own patch, as well). 0002 is the
addition of the options I suggested upthread, but we may change that
depending on what gets used for the server-side compression for
consistency so I am not suggesting to merge that until we agree on the
full picture. The point of this thread was mostly about 0001, so I am
fine to discard 0002. Thoughts?
--
Michael

Attachments:

v3-0001-Refactor-walmethods.c-s-tar-method-to-use-compres.patchtext/x-diff; charset=us-asciiDownload+67-20
v3-0002-Rework-the-option-set-of-pg_basebackup.patchtext/x-diff; charset=us-asciiDownload+95-29
#11Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#8)
Re: Refactoring of compression options in pg_basebackup

On Wed, Jan 05, 2022 at 10:22:06AM -0500, Tom Lane wrote:

I think the existing precedent is to skip the test if tar isn't there,
cf pg_basebackup/t/010_pg_basebackup.pl. But certainly the majority of
buildfarm animals have it.

Even Windows environments should be fine, aka recent edc2332.
--
Michael

#12Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#10)
Re: Refactoring of compression options in pg_basebackup

On Thu, Jan 6, 2022 at 12:04 AM Michael Paquier <michael@paquier.xyz> wrote:

Yeah. There are cases for both. I just got to wonder whether it
makes sense to allow both server-side and client-side compression to
be used at the same time. That would be a rather strange case, but
well, with the correct set of options that could be possible.

I don't think it makes sense to support that. On the one hand, it
doesn't seem useful: compressing already-compressed data doesn't
usually work very well. Alternatively, I suppose the intent could be
to compress one way for transfer and then decompress and recompress
for storage, but that seems too inefficient to take seriously. On the
other hand, it requires a more complex user interface, and it's
already fairly complicated anyway.

My view of things is slightly different, aka I'd rather keep
--compress to mean a compression level with an integer option, but
introduce a --compression-method={lz4,gzip,none}, with -Z being a
synonym of --compression-method=gzip. That's at least the path we
chose for pg_receivewal. I don't mind sticking with one way or
another, as what you are proposing is basically the same thing I have
in mind, but both tools ought to use the same set of options.

Did you mean that -z would be a synonym for --compression-method=gzip?
It doesn't really make sense for -Z to be that, unless it's also
setting the compression level.

My objection to --compress=$LEVEL is that the compression level seems
like it ought to rightfully be subordinate to the choice of algorithm.
In general, there's no reason why a compression algorithm has to offer
a choice of compression levels at all, or why they have to be numbered
0 through 9. For example, lz4 on my system claims to offer compression
levels from 1 through 12, plus a separate set of "fast" compression
levels starting with 1 and going up to an unspecified number. And then
it also has options to favor decompression speed, change the block
size, and a few other parameters. We don't necessarily want to expose
all of those options, but we should structure things so that we could
if it became important. The way to do that is to make the compression
algorithm the primary setting, and then anything else you can set for
that compressor is somehow a subordinate setting.

Put another way, we don't decide first that we want to compress with
level 7, and then afterward decide whether that's gzip, lz4, or bzip2.
We pick the compressor first, and then MAYBE think about changing the
compression level.

In the proposed patch, you end up with pg_basebackup
--compression-method=lz4 -Z2 meaning compression with lz4 level 2. I
find that quite odd, though as with all such things, opinions may
vary. In my proposal, that would be an error, because it would be
equivalent to --compress=lz4 --compress=gzip --compression-level=2,
and would thus involve conflicting compression method specifications.

It seems to me that you did not read the patch closely enough. The
attached patch does not add support for LZ4 in pg_basebackup on the
client-side yet. Once it gets added, though, the idea is that using
--compress with LZ4 would result in an error. That's what happens
with pg_receivewal on HEAD, for one. The patch just shapes things to
plug LZ4 more easily in the existing code of pg_basebackup.c, and
walmethods.c.

Well what I was looking at was this:

- printf(_("  -Z, --compress=0-9     compress tar output with given
compression level\n"));
+ printf(_("  -Z, --compress=1-9     compress tar output with given
compression level\n"));
+ printf(_("      --compression-method=METHOD\n"
+ "                         method to compress data\n"));

That seems to show that, post-patch, the argument to -Z would be a
compression level, even if --compression-method were something other
than gzip.

It's possible that I haven't read something carefully enough, but to
me, what I said seems to be a straightforward conclusion based on
looking at the usage help in the patch. So if I came to the wrong
conclusion, perhaps that usage help isn't reflecting the situation you
intend to create, or not as clearly as it ought.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#12)
Re: Refactoring of compression options in pg_basebackup

On Thu, Jan 06, 2022 at 09:27:19AM -0500, Robert Haas wrote:

Did you mean that -z would be a synonym for --compression-method=gzip?
It doesn't really make sense for -Z to be that, unless it's also
setting the compression level.

Yes, I meant "-z", not "-Z", to be a synonym of
--compression-method=gzip. Sorry for the typo.

My objection to --compress=$LEVEL is that the compression level seems
like it ought to rightfully be subordinate to the choice of algorithm.
In general, there's no reason why a compression algorithm has to offer
a choice of compression levels at all, or why they have to be numbered
0 through 9. For example, lz4 on my system claims to offer compression
levels from 1 through 12, plus a separate set of "fast" compression
levels starting with 1 and going up to an unspecified number. And then
it also has options to favor decompression speed, change the block
size, and a few other parameters. We don't necessarily want to expose
all of those options, but we should structure things so that we could
if it became important. The way to do that is to make the compression
algorithm the primary setting, and then anything else you can set for
that compressor is somehow a subordinate setting.

For any compression method, that maps to an integer, so.. But I am
not going to fight hard on that.

Put another way, we don't decide first that we want to compress with
level 7, and then afterward decide whether that's gzip, lz4, or bzip2.
We pick the compressor first, and then MAYBE think about changing the
compression level.

Which is why things should be checked once all the options are
processed. I'd recommend that you read the option patch a bit more,
that may help.

Well what I was looking at was this:

- printf(_("  -Z, --compress=0-9     compress tar output with given
compression level\n"));
+ printf(_("  -Z, --compress=1-9     compress tar output with given
compression level\n"));
+ printf(_("      --compression-method=METHOD\n"
+ "                         method to compress data\n"));

That seems to show that, post-patch, the argument to -Z would be a
compression level, even if --compression-method were something other
than gzip.

Yes, after the patch --compress would be a compression level. And, if
attempting to use with --compression-method set to "none", or
potentially "lz4", it would just fail. If not using this "gzip", the
compression level is switched to Z_DEFAULT_COMPRESSION. That's this
area of the patch, FWIW:
+ /*
+ * Compression-related options.
+ */
+ switch (compression_method)

It's possible that I haven't read something carefully enough, but to
me, what I said seems to be a straightforward conclusion based on
looking at the usage help in the patch. So if I came to the wrong
conclusion, perhaps that usage help isn't reflecting the situation you
intend to create, or not as clearly as it ought.

Perhaps the --help output could be clearer, then. Do you have a
suggestion?

Bringing walmethods.c at the same page for the directory and the tar
methods was my primary goal here, and the tests are a bonus, so I've
applied this part for now, leaving pg_basebackup alone until we figure
out the layer of options we should use. Perhaps it would be better to
revisit that stuff once the server-side compression has landed.
--
Michael

#14Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#13)
Re: Refactoring of compression options in pg_basebackup

On Fri, Jan 7, 2022 at 1:43 AM Michael Paquier <michael@paquier.xyz> wrote:

Which is why things should be checked once all the options are
processed. I'd recommend that you read the option patch a bit more,
that may help.

I don't think that the problem here is my lack of understanding. I
have two basic concerns about your proposed patch:

1. If, as you propose, we add a new flag --compression-method=METHOD
then how will the user specify server-side compression?
2. If, as we seem to agree, the compression method is more important
than the compression level, then why is the option to set the
less-important thing called just --compress, and the option to set the
more important thing has a longer name?

I proposed to solve both of these problems by using
--compression-level=NUMBER to set the compression level and
--compress=METHOD or --server-compress=METHOD to set the algorithm and
specify on which side it is to be applied. If, instead of doing that,
we go with what you have proposed here, then I don't understand how to
fit server-side compression into the framework in a reasonably concise
way. I think we would end up with something like pg_basebackup
--compression-method=lz4 --compress-on-server, which seems rather long
and awkward. Do you have a better idea?

I think I understand what the patch is doing. I just think it creates
a problem for my patch. And I'd like to know whether you have an idea
how to solve that problem. And if not, then I'd like you to consider
the solution that I am proposing rather than the patch you've already
got.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#14)
Re: Refactoring of compression options in pg_basebackup

On Thu, Jan 13, 2022 at 01:36:00PM -0500, Robert Haas wrote:

1. If, as you propose, we add a new flag --compression-method=METHOD
then how will the user specify server-side compression?

This would require a completely different option switch, which is
basically the same thing as what you are suggesting with
--server-compress.

2. If, as we seem to agree, the compression method is more important
than the compression level, then why is the option to set the
less-important thing called just --compress, and the option to set the
more important thing has a longer name?

I agree that the method is more important than the level for most
users, and I would not mind dropping completely --compress in favor of
something else, which is something I implied upthread.

I proposed to solve both of these problems by using
--compression-level=NUMBER to set the compression level and
--compress=METHOD or --server-compress=METHOD to set the algorithm and
specify on which side it is to be applied. If, instead of doing that,
we go with what you have proposed here, then I don't understand how to
fit server-side compression into the framework in a reasonably concise
way. I think we would end up with something like pg_basebackup
--compression-method=lz4 --compress-on-server, which seems rather long
and awkward. Do you have a better idea?

Using --compression-level=NUMBER and --server-compress=METHOD to
specify a server-side compression method with a level is fine by me,
but I find the reuse of --compress to specify a compression method
confusing as it maps with the past option we have kept in
pg_basebackup for a couple of years now. Based on your suggested set
of options, we could then have a --client-compress=METHOD and
--compression-level=NUMBER to specify a client-side compression method
with a level. If we do that, I guess that we should then:
1) Block the combination of --server-compress and --client-compress.
2) Remove the existing -Z/--compress and -z/--gzip.

You have implied 1) upthread as far as I recall, 2) is something I am
adding on top of it.

I think I understand what the patch is doing. I just think it creates
a problem for my patch. And I'd like to know whether you have an idea
how to solve that problem. And if not, then I'd like you to consider
the solution that I am proposing rather than the patch you've already
got.

I am fine to drop this thread's patch with its set of options and work
on top of your proposal, aka what's drafted two paragraphs above.
--
Michael

#16Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#15)
Re: Refactoring of compression options in pg_basebackup

On Thu, Jan 13, 2022 at 10:23 PM Michael Paquier <michael@paquier.xyz> wrote:

Using --compression-level=NUMBER and --server-compress=METHOD to
specify a server-side compression method with a level is fine by me,
but I find the reuse of --compress to specify a compression method
confusing as it maps with the past option we have kept in
pg_basebackup for a couple of years now. Based on your suggested set
of options, we could then have a --client-compress=METHOD and
--compression-level=NUMBER to specify a client-side compression method
with a level. If we do that, I guess that we should then:
1) Block the combination of --server-compress and --client-compress.
2) Remove the existing -Z/--compress and -z/--gzip.

I could live with that. I'm not sure that --client-compress instead of
reusing --compress is going to be better ... but I don't think it's
awful so much as just not my first choice. I also don't think it would
be horrid to leave -z, --gzip, and -Z as shorthands for the
--client-compress=gzip with --compression-level also in the last case,
instead of removing all that stuff.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#16)
Re: Refactoring of compression options in pg_basebackup

On Fri, Jan 14, 2022 at 04:53:12PM -0500, Robert Haas wrote:

On Thu, Jan 13, 2022 at 10:23 PM Michael Paquier <michael@paquier.xyz> wrote:

Using --compression-level=NUMBER and --server-compress=METHOD to
specify a server-side compression method with a level is fine by me,
but I find the reuse of --compress to specify a compression method
confusing as it maps with the past option we have kept in
pg_basebackup for a couple of years now. Based on your suggested set
of options, we could then have a --client-compress=METHOD and
--compression-level=NUMBER to specify a client-side compression method
with a level. If we do that, I guess that we should then:
1) Block the combination of --server-compress and --client-compress.
2) Remove the existing -Z/--compress and -z/--gzip.

I could live with that. I'm not sure that --client-compress instead of
reusing --compress is going to be better ... but I don't think it's
awful so much as just not my first choice. I also don't think it would
be horrid to leave -z, --gzip, and -Z as shorthands for the
--client-compress=gzip with --compression-level also in the last case,
instead of removing all that stuff.

Okay. So, based on this feedback, I guess that something like the
attached would be what we are looking for. I have maximized the
amount of code removed with the removal of -z/-Z, but I won't fight
hard if the consensus is to keep them, either. We could also keep
-z/--gzip, and stick -Z to the new --compression-level with
--compress removed.
--
Michael

Attachments:

v4-0001-Rework-the-option-set-of-pg_basebackup.patchtext/x-diff; charset=us-asciiDownload+105-65
#18Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#16)
Re: Refactoring of compression options in pg_basebackup

On Fri, Jan 14, 2022 at 10:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 13, 2022 at 10:23 PM Michael Paquier <michael@paquier.xyz> wrote:

Using --compression-level=NUMBER and --server-compress=METHOD to
specify a server-side compression method with a level is fine by me,
but I find the reuse of --compress to specify a compression method
confusing as it maps with the past option we have kept in
pg_basebackup for a couple of years now. Based on your suggested set
of options, we could then have a --client-compress=METHOD and
--compression-level=NUMBER to specify a client-side compression method
with a level. If we do that, I guess that we should then:
1) Block the combination of --server-compress and --client-compress.
2) Remove the existing -Z/--compress and -z/--gzip.

I could live with that. I'm not sure that --client-compress instead of
reusing --compress is going to be better ... but I don't think it's
awful so much as just not my first choice. I also don't think it would
be horrid to leave -z, --gzip, and -Z as shorthands for the
--client-compress=gzip with --compression-level also in the last case,
instead of removing all that stuff.

It never makes sense to compress *both* in server and client, right?

One argument in that case for using --compress would be that we could
have that one take options like --compress=gzip (use gzip in the
client) and --compress=server-lz4 (use lz4 on the server), and
automatically make it impossible to do both. And maybe also accept
--compress=client-gzip (which would be the same as just specifying
gzip).

That would be an argument for actually keeping --compress and not
using --client-compress, because obviously it would be silly to have
--client-compress=server-lz4...

And yes, I agree that considering both server and client compression
even if we don't have server compression makes sense, since we don't
want to change things around again when we get it.

We could perhaps also consider accepting --compress=gzip:7
(<method>:<level>) as a way to specify the level, for both client and
server side.

I think having --client-compress and --server-compress separately but
having --compression-level *not* being separate would be confusing and
I *think* that's what the current patch proposes?

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#19Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#18)
Re: Refactoring of compression options in pg_basebackup

On Sat, Jan 15, 2022 at 04:15:26PM +0100, Magnus Hagander wrote:

I think having --client-compress and --server-compress separately but
having --compression-level *not* being separate would be confusing and
I *think* that's what the current patch proposes?

Yep, your understanding is right. The last version of the patch
posted does exactly that.
--
Michael

#20Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#17)
Re: Refactoring of compression options in pg_basebackup

On Fri, Jan 14, 2022 at 9:54 PM Michael Paquier <michael@paquier.xyz> wrote:

Okay. So, based on this feedback, I guess that something like the
attached would be what we are looking for. I have maximized the
amount of code removed with the removal of -z/-Z, but I won't fight
hard if the consensus is to keep them, either. We could also keep
-z/--gzip, and stick -Z to the new --compression-level with
--compress removed.

I mean, I really don't understand the benefit of removing -z and -Z.
-z can remain a synonym for --client-compress=gzip and -Z for
--client-compress=gzip --compression-level=$N and nobody will be
harmed. Taking them out reduces backward compatibility for no gain
that I can see.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Magnus Hagander (#18)
#22Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#21)
#23Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#19)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Magnus Hagander (#22)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#24)
#26David G. Johnston
david.g.johnston@gmail.com
In reply to: Robert Haas (#24)
#27David G. Johnston
david.g.johnston@gmail.com
In reply to: Alvaro Herrera (#25)
#28Robert Haas
robertmhaas@gmail.com
In reply to: David G. Johnston (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#31)
#33Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#31)
#34Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#32)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#41)