pg_basebackup: Allow use of arbitrary compression program

Started by Michael Harrisalmost 9 years ago5 messages
#1Michael Harris
harmic@gmail.com

Hello,

Back in pg 9.2, we hacked a copy of pg_basebackup to add a command
line option which would allow the user to specify an arbitrary
external program (potentially including arguments) to be used to
compress the tar backup.

Our motivation was to be able to use pigz (parallel gzip
implementation) to speed up the compression. It also allows using
tools like bzip2, xz, etc instead of the inbuilt zlib.

I never ended up submitting that upstream, but now it looks like I
will have to repeat the exercise for 9.6, so I was wondering if such a
feature would be welcomed.

I found one or two references to people asking for this, eg:
https://www.commandprompt.com/blog/a_pg_basebackup_wish_list/

To do it properly would require:

1) Adding command line option as follows:

-C, --compressprog=PROG
Use supplied program for compression

2) The current logic either uses zlib if compiled in, or offers no
compression at all, controlled by a series of #ifdef/#endif. I would
prefer that the user can either use zlib or an external program
without having to recompile, so I would remove the #ifdefs and replace
them with run time branching.

3) When opening the output file, if the -C option was used, use popen
to open a child process and write to that.

My questions are:
- Has anything like this already been discussed?
- Would this be a welcome contribution?
- Can anyone see any problems with the above approach?

Thanks!

Regards
Mike Harris

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Jeff Janes
jeff.janes@gmail.com
In reply to: Michael Harris (#1)
Re: pg_basebackup: Allow use of arbitrary compression program

On Thu, Apr 6, 2017 at 7:04 PM, Michael Harris <harmic@gmail.com> wrote:

Hello,

Back in pg 9.2, we hacked a copy of pg_basebackup to add a command
line option which would allow the user to specify an arbitrary
external program (potentially including arguments) to be used to
compress the tar backup.

Our motivation was to be able to use pigz (parallel gzip
implementation) to speed up the compression. It also allows using
tools like bzip2, xz, etc instead of the inbuilt zlib.

I never ended up submitting that upstream, but now it looks like I
will have to repeat the exercise for 9.6, so I was wondering if such a
feature would be welcomed.

I would welcome it. I would really like to be able to use parallel pigz
and pxz.

You can stream the data into a compression tool of your choice as long as
you use tar mode and specify '-D -', but that is incompatible with table
spaces, and with xlog streaming, and so is not a very good solution.

Cheers,

Jeff

#3Magnus Hagander
magnus@hagander.net
In reply to: Michael Harris (#1)
Re: pg_basebackup: Allow use of arbitrary compression program

On Fri, Apr 7, 2017 at 4:04 AM, Michael Harris <harmic@gmail.com> wrote:

Hello,

Back in pg 9.2, we hacked a copy of pg_basebackup to add a command
line option which would allow the user to specify an arbitrary
external program (potentially including arguments) to be used to
compress the tar backup.

Our motivation was to be able to use pigz (parallel gzip
implementation) to speed up the compression. It also allows using
tools like bzip2, xz, etc instead of the inbuilt zlib.

I never ended up submitting that upstream, but now it looks like I
will have to repeat the exercise for 9.6, so I was wondering if such a
feature would be welcomed.

I found one or two references to people asking for this, eg:
https://www.commandprompt.com/blog/a_pg_basebackup_wish_list/

To do it properly would require:

1) Adding command line option as follows:

-C, --compressprog=PROG
Use supplied program for compression

2) The current logic either uses zlib if compiled in, or offers no
compression at all, controlled by a series of #ifdef/#endif. I would
prefer that the user can either use zlib or an external program
without having to recompile, so I would remove the #ifdefs and replace
them with run time branching.

Not sure how that would work or be needed. The reasonable thing would be if
zlib is available when building the choices would be "no compression",
"zlib compression" or "external compression". If there was no zlib
available when building, the choices would be "no compression" or "external
compression".

Or maybe I'm misunderstanding what you're saying?

3) When opening the output file, if the -C option was used, use popen
to open a child process and write to that.

My questions are:
- Has anything like this already been discussed?

I think it has, but not in detail.

- Would this be a welcome contribution?

Yes, I definitely think this would be useful.

- Can anyone see any problems with the above approach?

One thing to consider is the work done recently to ensure that the output
is properly synchronized when written to disk. I don't think it's
reasonable to expect that from an external compression, but if it can be
made optional that'd be good. Or at least be careful not to break the
current one.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#4Michael Harris
harmic@gmail.com
In reply to: Magnus Hagander (#3)
Re: pg_basebackup: Allow use of arbitrary compression program

Hi,

Thanks for the feedback!

2) The current logic either uses zlib if compiled in, or offers no
compression at all, controlled by a series of #ifdef/#endif. I would
prefer that the user can either use zlib or an external program
without having to recompile, so I would remove the #ifdefs and replace
them with run time branching.

Not sure how that would work or be needed. The reasonable thing would be if zlib
is available when building the choices would be "no compression",
"zlib compression" or "external compression". If there was no zlib available
when building, the choices would be "no compression" or "external compression".

That's exactly how I intend it to work. I had thought that the current
structure of the code would not allow that, but looking at it more
closely I see that it does, so I don't have to re-organize the
#ifdefs.

Regards // Mike

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Harris
harmic@gmail.com
In reply to: Michael Harris (#4)
Re: pg_basebackup: Allow use of arbitrary compression program

Hi All,

I have a working prototype now, but there is one aspect I haven't been
able to find the best solution for.

The CLI interface so far has the following new added option:

-C, --compressprog=PRG use supplied external program for compression

An example usage would be:

pg_basebackup -D /home/harmic/tmp/ -C bzip2 -F t

The command string supplied to -C should be a compression command that
reads from stdin and outputs to stdout.

The problem is: when constructing output filename(s), how can we
suffix them with the correct suffix (.gz / .bz2 / .xz / ....) ?

The options I can think of are:

1. Add yet another command line option to specify a suffix
2. Some kind of heuristic to figure it out from the supplied command
string (from known compression programs, but that will never be
complete)
3. Don't worry about it, let the user rename them afterwards, in
which case they would be named xxxx.tar
4. Make the compression command a template, eg. "bzip2 -c > %s.bz2",
so that the template itself will add the suffix

#4 might also be more flexible for tools that don't support output to
stdout, but it is a bit more complex to use.

Any other ideas?

Regards // Mike

On Wed, Apr 12, 2017 at 3:49 PM, Michael Harris <harmic@gmail.com> wrote:

Hi,

Thanks for the feedback!

2) The current logic either uses zlib if compiled in, or offers no
compression at all, controlled by a series of #ifdef/#endif. I would
prefer that the user can either use zlib or an external program
without having to recompile, so I would remove the #ifdefs and replace
them with run time branching.

Not sure how that would work or be needed. The reasonable thing would be if zlib
is available when building the choices would be "no compression",
"zlib compression" or "external compression". If there was no zlib available
when building, the choices would be "no compression" or "external compression".

That's exactly how I intend it to work. I had thought that the current
structure of the code would not allow that, but looking at it more
closely I see that it does, so I don't have to re-organize the
#ifdefs.

Regards // Mike

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers