pg_upgrade --copy-file-range

Started by Thomas Munroabout 3 years ago46 messageshackers

thomas.munro@gmail.com

about 3 years ago

Hello,

I was just in a pg_upgrade unconference session at PGCon where the
lack of $SUBJECT came up. This system call gives the kernel the
option to use fast block cloning on XFS, ZFS (as of very recently),
etc, and works on Linux and FreeBSD. It's probably much the same as
--clone mode on COW file systems, except that is Linux-only. On
overwrite file systems (ie not copy-on-write, like ext4), it may also
be able to push copies down to storage hardware/network file systems.

There was something like this in the nearby large files patch set, but
in that version it just magically did it when available in --copy
mode. Now I think the user should have to have to opt in with
--copy-file-range, and simply to error out if it fails. It may not
work in some cases -- for example, the man page says that older Linux
systems can fail with EXDEV when you try to copy across file systems,
while newer systems will do something less efficient but still
sensible internally; also I saw a claim that some older versions had
weird bugs. Better to just expose the raw functionality and let users
say when they want it and read the error if it fail, I think.

Peter Eisentraut

peter_e@gmx.net

almost 3 years ago

In reply to: Thomas Munro (#1)

Re: pg_upgrade --copy-file-range

On 02.06.23 21:30, Thomas Munro wrote:

I was just in a pg_upgrade unconference session at PGCon where the
lack of $SUBJECT came up. This system call gives the kernel the
option to use fast block cloning on XFS, ZFS (as of very recently),
etc, and works on Linux and FreeBSD. It's probably much the same as
--clone mode on COW file systems, except that is Linux-only. On
overwrite file systems (ie not copy-on-write, like ext4), it may also
be able to push copies down to storage hardware/network file systems.

There was something like this in the nearby large files patch set, but
in that version it just magically did it when available in --copy
mode. Now I think the user should have to have to opt in with
--copy-file-range, and simply to error out if it fails. It may not
work in some cases -- for example, the man page says that older Linux
systems can fail with EXDEV when you try to copy across file systems,
while newer systems will do something less efficient but still
sensible internally; also I saw a claim that some older versions had
weird bugs. Better to just expose the raw functionality and let users
say when they want it and read the error if it fail, I think.

When we added --clone, copy_file_range() was available, but the problem
was that it was hard for the user to predict whether you'd get the fast
clone behavior or the slow copy behavior. That's the kind of thing you
want to know when planning and testing your upgrade. At the time, there
were patches passed around in Linux kernel circles that would have been
able to enforce cloning via the flags argument of copy_file_range(), but
that didn't make it to the mainline.

So, yes, being able to specify exactly which copy mechanism to use makes
sense, so that users can choose the tradeoffs.

About your patch:

I think you should have a "check" function called from
check_new_cluster(). That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine(). I suggest following the clone example for these,
since the issues there are very similar.

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Peter Eisentraut (#2)

Re: pg_upgrade --copy-file-range

On Mon, Jul 3, 2023 at 7:47 PM Peter Eisentraut <peter@eisentraut.org> wrote:

When we added --clone, copy_file_range() was available, but the problem
was that it was hard for the user to predict whether you'd get the fast
clone behavior or the slow copy behavior. That's the kind of thing you
want to know when planning and testing your upgrade. At the time, there
were patches passed around in Linux kernel circles that would have been
able to enforce cloning via the flags argument of copy_file_range(), but
that didn't make it to the mainline.

So, yes, being able to specify exactly which copy mechanism to use makes
sense, so that users can choose the tradeoffs.

Thanks for looking. Yeah, it is quite inconvenient for planning
purposes that it is hard for a user to know which internal strategy it
uses, but that's the interface we have (and clearly "flags" is
reserved for future usage so that might still evolve..).

About your patch:

I think you should have a "check" function called from
check_new_cluster(). That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine(). I suggest following the clone example for these,
since the issues there are very similar.

Done.

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: Thomas Munro (#3)

Re: pg_upgrade --copy-file-range

On 08.10.23 07:15, Thomas Munro wrote:

About your patch:

I think you should have a "check" function called from
check_new_cluster(). That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine(). I suggest following the clone example for these,
since the issues there are very similar.

Done.

This version looks good to me.

Tiny nit: You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a
different suffix.

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: Peter Eisentraut (#4)

Re: pg_upgrade --copy-file-range

On 13.11.23 08:15, Peter Eisentraut wrote:

On 08.10.23 07:15, Thomas Munro wrote:

About your patch:

I think you should have a "check" function called from
check_new_cluster(). That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine(). I suggest following the clone example for these,
since the issues there are very similar.

Done.

This version looks good to me.

Tiny nit: You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a
different suffix.

Thomas, are you planning to proceed with this patch?

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Peter Eisentraut (#5)

Re: pg_upgrade --copy-file-range

On Sat, Dec 23, 2023 at 9:40 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 13.11.23 08:15, Peter Eisentraut wrote:

On 08.10.23 07:15, Thomas Munro wrote:

About your patch:

I think you should have a "check" function called from
check_new_cluster(). That check function can then also handle the "not
supported" case, and you don't need to handle that in
parseCommandLine(). I suggest following the clone example for these,
since the issues there are very similar.

Done.

This version looks good to me.

Tiny nit: You copy-and-pasted "%s/PG_VERSION.clonetest"; maybe choose a
different suffix.

Thomas, are you planning to proceed with this patch?

Yes. Sorry for being slow... got stuck working on an imminent new
version of streaming read. I will be defrosting my commit bit and
committing this one and a few things shortly.

As it happens I was just thinking about this particular patch because
I suddenly had a strong urge to teach pg_combinebackup to use
copy_file_range. I wonder if you had the same idea...

Michael Paquier

michael@paquier.xyz

over 2 years ago

In reply to: Thomas Munro (#6)

Re: pg_upgrade --copy-file-range

On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:

As it happens I was just thinking about this particular patch because
I suddenly had a strong urge to teach pg_combinebackup to use
copy_file_range. I wonder if you had the same idea...

Yeah, +1. That would make copy_file_blocks() more efficient where the
code is copying 50 blocks in batches because it needs to reassign
checksums to the blocks copied.
--
Michael

Jakub Wartak

jakub.wartak@enterprisedb.com

over 2 years ago

In reply to: Michael Paquier (#7)

Re: pg_upgrade --copy-file-range

Hi Thomas, Michael, Peter and -hackers,

On Sun, Dec 24, 2023 at 3:57 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Dec 23, 2023 at 09:52:59AM +1300, Thomas Munro wrote:

As it happens I was just thinking about this particular patch because
I suddenly had a strong urge to teach pg_combinebackup to use
copy_file_range. I wonder if you had the same idea...

Yeah, +1. That would make copy_file_blocks() more efficient where the
code is copying 50 blocks in batches because it needs to reassign
checksums to the blocks copied.

I've tried to achieve what you were discussing. Actually this was my
first thought when using pg_combinebackup with larger (realistic)
backup sizes back in December. Attached is a set of very DIRTY (!)
patches that provide CoW options (--clone/--copy-range-file) to
pg_combinebackup (just like pg_upgrade to keep it in sync), while also
refactoring some related bits of code to avoid duplication.

With XFS (with reflink=1 which is default) on Linux with kernel 5.10
and ~210GB backups, I'm getting:

root@jw-test-1:/xfs# du -sm *
210229 full
250 incr.1

Today in master, the old classic read()/while() loop without
CoW/reflink optimization :
root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE -o outtest full incr.1
3

real 49m43.963s
user 0m0.887s
sys 2m52.697s

VS patch with "--clone" :

root@jw-test-1:/xfs# rm -rf outtest; sync; sync ; sync; echo 3 | sudo
tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup
--manifest-checksums=NONE --clone -o outtest full incr.1
3

real 0m39.812s
user 0m0.325s
sys 0m2.401s

So it is 49mins down to 40 seconds(!) +/-10s (3 tries) if the FS
supports CoW/reflinks (XFS, BTRFS, upcoming bcachefs?). It looks to me
that this might mean that if one actually wants to use incremental
backups (to get minimal RTO), it would be wise to only use CoW
filesystems from the start so that RTO is as low as possible.

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.

Any help is appreciated.

-J.

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: Jakub Wartak (#8)

Re: pg_upgrade --copy-file-range

On 05.01.24 13:40, Jakub Wartak wrote:

Random patch notes:
- main meat is in v3-0002*, I hope i did not screw something seriously
- in worst case: it is opt-in through switch, so the user always can
stick to the classic copy
- no docs so far
- pg_copyfile_offload_supported() should actually be fixed if it is a
good path forward
- pgindent actually indents larger areas of code that I would like to,
any ideas or is it ok?
- not tested on Win32/MacOS/FreeBSD
- i've tested pg_upgrade manually and it seems to work and issue
correct syscalls, however some tests are failing(?). I haven't
investigated why yet due to lack of time.

Something is wrong with the pgindent in your patch set. Maybe you used
a wrong version. You should try to fix that, because it is hard to
process your patch with that amount of unrelated reformatting.

As far as I can tell, the original pg_upgrade patch has been ready to
commit since October. Unless Thomas has any qualms that have not been
made explicit in this thread, I suggest we move ahead with that.

And then Jakub could rebase his patch set on top of that. It looks like
if the formatting issues are fixed, the remaining pg_combinebackup
support isn't that big.

#10

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Peter Eisentraut (#9)

Re: pg_upgrade --copy-file-range

On Wed, Mar 6, 2024 at 2:43 AM Peter Eisentraut <peter@eisentraut.org> wrote:

As far as I can tell, the original pg_upgrade patch has been ready to
commit since October. Unless Thomas has any qualms that have not been
made explicit in this thread, I suggest we move ahead with that.

pg_upgrade --copy-file-range pushed. The only change I made was to
remove the EINTR retry condition which was subtly wrong and actually
not needed here AFAICS. (Erm, maybe I did have an unexpressed qualm
about some bug reports unfolding around that time about corruption
linked to copy_file_range that might have spooked me but those seem to
have been addressed.)

And then Jakub could rebase his patch set on top of that. It looks like
if the formatting issues are fixed, the remaining pg_combinebackup
support isn't that big.

I'll also go and rebase CREATE DATABASE ... STRATEGY=file_clone[1]/messages/by-id/CA+hUKGLM+t+SwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w@mail.gmail.com.

[1]: /messages/by-id/CA+hUKGLM+t+SwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w@mail.gmail.com

#11

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Thomas Munro (#10)

Re: pg_upgrade --copy-file-range

Hi,

I took a quick look at the remaining part adding copy_file_range to
pg_combinebackup. The patch no longer applies, so I had to rebase it.
Most of the issues were trivial, but I had to fix a couple missing
prototypes - I added them to copy_file.h/c, mostly.

0001 is the minimal rebase + those fixes

0002 has a couple review comments in copy_file, and it also undoes a lot
of unnecessary formatting changes (already pointed out by Peter a couple
days ago).

A couple review comments:

1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.

2) I wonder if we even need opt_errinfo(). I'm not sure it actually
makes anything simpler.

3) I think it'd be nice to make CopyFileMethod more consistent with
transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
more consistent, it's probably not worth unifying this somehow).

4) I wonder how we came up with copying the files by 50 blocks, but I
now realize it's been like this before this patch. I only noticed
because the patch adds a comment before buffer_size calculation.

5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
name is way more generic / less descriptive - it's clear it copies the
file block by block (well, in chunks). pg_copyfile is pretty vague.

6) This leaves behind copy_file_copyfile, which is now unused.

7) The patch reworks how combinebackup deals with alternative copy
implementations - instead of setting strategy_implementation and calling
that, the decisions now happen in pg_copyfile_offload with a lot of
conditions / ifdef / defined ... I find it pretty hard to understand and
reason about. I liked the strategy_implementation approach, as it forces
us to keep each method in a separate function.

Perhaps there's a reason why that doesn't work for copy_file_range? But
in that case this needs much clearer comments.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Jakub Wartak

jakub.wartak@enterprisedb.com

over 2 years ago

In reply to: Tomas Vondra (#11)

Re: pg_upgrade --copy-file-range

Hi Tomas,

I took a quick look at the remaining part adding copy_file_range to
pg_combinebackup. The patch no longer applies, so I had to rebase it.
Most of the issues were trivial, but I had to fix a couple missing
prototypes - I added them to copy_file.h/c, mostly.

0001 is the minimal rebase + those fixes

0002 has a couple review comments in copy_file, and it also undoes a lot
of unnecessary formatting changes (already pointed out by Peter a couple
days ago).

Thank you very much for this! As discussed privately, I'm not in
position right now to pursue this further at this late stage (at least
for v17, which would require an aggressive schedule ). My plan was
more for v18 after Peter's email, due to other obligations. But if you
have cycles and want to continue, please do so without hesitation -
I'll try to chime in a long way to test and review for sure.

A couple review comments:

1) AFAIK opt_errinfo() returns pointer to the local "buf" variable.

2) I wonder if we even need opt_errinfo(). I'm not sure it actually
makes anything simpler.

Yes, as it stands it's broken (somewhat I've missed gcc warning),
should be pg_malloc(). I hardly remember, but I wanted to avoid code
duplication. No strong opinion, maybe that's a different style, I'll
adapt as necessary.

3) I think it'd be nice to make CopyFileMethod more consistent with
transferMode in pg_upgrade.h (I mean, it seems wise to make the naming
more consistent, it's probably not worth unifying this somehow).

4) I wonder how we came up with copying the files by 50 blocks, but I
now realize it's been like this before this patch. I only noticed
because the patch adds a comment before buffer_size calculation.

It looks like it was like that before pg_upgrade even was moved into
the core. 400kB is indeed bit strange value, so we can leave it as it
is or make the COPY_BUF_SIZ 128kb - see [1]https://eklitzke.org/efficient-file-copying-on-linux (i've double checked cp(1)
uses still 128kB today), or maybe just stick to something like 256 or
512 kBs.

5) I dislike the renaming of copy_file_blocks to pg_copyfile. The new
name is way more generic / less descriptive - it's clear it copies the
file block by block (well, in chunks). pg_copyfile is pretty vague.

6) This leaves behind copy_file_copyfile, which is now unused.

7) The patch reworks how combinebackup deals with alternative copy
implementations - instead of setting strategy_implementation and calling
that, the decisions now happen in pg_copyfile_offload with a lot of
conditions / ifdef / defined ... I find it pretty hard to understand and
reason about. I liked the strategy_implementation approach, as it forces
us to keep each method in a separate function.

Well some context (maybe it was my mistake to continue in this
./thread rather starting a new one): my plan was 3-in-1: in the
original proposal (from Jan) to provide CoW as generic facility for
other to use - in src/common/file_utils.c as per
v3-0002-Confine-various-OS-copy-on-write-and-other-copy-a.patch - to
unify & confine CoW methods and their quirkiness between
pg_combinebackup and pg_upgrade and other potential CoW uses too. That
was before Thomas M. pushed CoW just for pg_upgrade as
d93627bcbe5001750e7611f0e637200e2d81dcff. I had this idea back then to
have pg_copyfile() [normal blocks copy] and
pg_copyfile_offload_supported(),
pg_copyfile_offload(PG_COPYFILE_IOCTL_FICLONE ,
PG_COPYFILE_COPY_FILE_RANGE,
PG_COPYFILE_who_has_idea_what_they_come_up_with_in_future). In Your's
version of the patch it's local to pg_combinebackup, so it might make
no sense after all. If you look at the pg_upgrade and pg_combinebackup
they both have code duplication with lots of those ifs/IFs (assuming
user wants to have it as drop-in [--clone/--copy/--copyfile] and
platform may / may not have it). I've even considered
--cow=ficlone|copy_file_range to sync both tools from CLI arguments
point of view, but that would break backwards compatibility, so I did
not do that.

Also there's a problem with pg_combinebackup's strategy_implementation
that it actually cannot on its own decide (I think) which CoW to use
or not. There were some longer discussions that settled on one thing
(for pg_upgrade): it's the user who is in control HOW the copy gets
done (due to potential issues in OS CoW() implementations where e.g.
if NFS would be involved on one side). See pg_upgrade
--clone/--copy/--copy-file-range/--sync-method options. I wanted to
stick to that, so pg_combinebackup also needs to give the same options
to the user.

That's was for the historical context, now you wrote "it's probably
not worth unifying this somehow" few sentences earlier, so my take is
the following: we can just concentrate on getting the
copy_file_range() and ioctl_ficlone to pg_combinebackup at the price
of duplicating some complexity for now (in short to start with clear
plate , it doesn't necessary needs to be my patch as base if we think
it's worthwhile for v17 - or stick to your reworked patch of mine).

Later (v18?) some bigger than this refactor could unify and move the
copy methods to some more central place (so then we would have sync as
there would be no doubling like you mentioned e.g.: pg_upgrade's enum
transferMode <-> patch enum CopyFileMethod.

So for now I'm +1 to renaming all the things as you want -- indeed
pg_copy* might not be a good fit in a localized version.

-J.

[1]: https://eklitzke.org/efficient-file-copying-on-linux

#13

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Jakub Wartak (#12)

Re: pg_upgrade --copy-file-range

Here's a patch reworked along the lines from a couple days ago.

The primary goals were to add clone/copy_file_range while minimizing
unnecessary disruption, and overall cleanup of the patch. I'm not saying
it's committable, but I think the patch is much easier to understand.

The main change is that this abandons the idea of handling all possible
cases in a single function that looks like a maze of ifdefs, and instead
separates each case into it's own function and the decision happens much
earlier. This is pretty much exactly what pg_upgrade does, BTW.

There's maybe an argument that these functions could be unified and
moved to a library in src/common - I can imagine doing that, but I don't
think it's required. The functions are pretty trivial wrappers, and it's
not like we expect many more callers. And there's probably stuff we'd
need to keep out of that library (e.g. the decision which copy/clone
methods are available / should be used or error reporting). So it
doesn't seem worth it, at least for now.

There's one question, though. As it stands, there's a bit of asymmetry
between handling CopyFile() on WIN32 and the clone/copy_file_range on
other platforms). On WIN32, we simply automatically switch to CopyFile
automatically, if we don't need to calculate checksum. But with the
other methods, error out if the user requests those and we need to
calculate the checksum.

The asymmetry comes from the fact there's no option to request CopyFile
on WIN32, and we feel confident it's always the right thing to do (safe,
faster). We don't seem to know that for the other methods, so the user
has to explicitly request those. And if the user requests --clone, for
example, it'd be wrong to silently fallback to plain copy.

Still, I wonder if this might cause some undesirable issues during
restores. But I guess that's why we have --dry-run.

This asymmetry also shows a bit in the code - the CopyFile is coded and
called a bit differently from the other methods. FWIW I abandoned the
approach with "strategy" and just use a switch on CopyMode enum, just
like pg_upgrade does.

There's a couple more smaller changes:

- Addition of docs for --clone/--copy-file-range (shameless copy from
pg_upgrade docs).

- Removal of opt_errinfo - not only was it buggy, I think the code is
actually cleaner without it.

- Removal of EINTR retry condition from copy_file_range handling (this
is what Thomas ended up for pg_upgrade while committing that part).

Put together, this cuts the patch from ~40kB to ~15kB (most of this is
due to the cleanup of unnecessary whitespace changes, though).

I think to make this committable, this requires some review and testing,
ideally on a range of platforms.

One open question is how to allow testing this. For pg_upgrade we now
have PG_TEST_PG_UPGRADE_MODE, which can be set to e.g. "--clone". I
wonder if we should add PG_TEST_PG_COMBINEBACKUP_MODE ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#13)

Re: pg_upgrade --copy-file-range

On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

There's one question, though. As it stands, there's a bit of asymmetry
between handling CopyFile() on WIN32 and the clone/copy_file_range on
other platforms). On WIN32, we simply automatically switch to CopyFile
automatically, if we don't need to calculate checksum. But with the
other methods, error out if the user requests those and we need to
calculate the checksum.

That seems completely broken. copy_file() needs to have the ability to
calculate a checksum if one is required; when one isn't required, it
can do whatever it likes. So we should always fall back to the
block-by-block method if we need a checksum. Whatever option the user
specified should only be applied when we don't need a checksum.

Consider, for example:

pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
pg_basebackup -D monday -c fast --manifest-checksums=SHA224
--incremental sunday/backup_manifest
pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone

Any files that are copied in their entirety from Sunday's backup can
be cloned, if we have support for cloning. But any files copied from
Monday's backup will need to be re-checksummed, since the checksum
algorithms don't match. With what you're describing, it sounds like
pg_combinebackup would just fail in this case; I don't think that's
the behavior that the user is going to want.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Robert Haas (#14)

Re: pg_upgrade --copy-file-range

On 3/22/24 17:42, Robert Haas wrote:

On Fri, Mar 22, 2024 at 10:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

There's one question, though. As it stands, there's a bit of asymmetry
between handling CopyFile() on WIN32 and the clone/copy_file_range on
other platforms). On WIN32, we simply automatically switch to CopyFile
automatically, if we don't need to calculate checksum. But with the
other methods, error out if the user requests those and we need to
calculate the checksum.

That seems completely broken. copy_file() needs to have the ability to
calculate a checksum if one is required; when one isn't required, it
can do whatever it likes. So we should always fall back to the
block-by-block method if we need a checksum. Whatever option the user
specified should only be applied when we don't need a checksum.

Consider, for example:

pg_basebackup -D sunday -c fast --manifest-checksums=CRC32C
pg_basebackup -D monday -c fast --manifest-checksums=SHA224
--incremental sunday/backup_manifest
pg_combinebackup sunday monday -o tuesday --manifest-checksums=CRC32C --clone

Any files that are copied in their entirety from Sunday's backup can
be cloned, if we have support for cloning. But any files copied from
Monday's backup will need to be re-checksummed, since the checksum
algorithms don't match. With what you're describing, it sounds like
pg_combinebackup would just fail in this case; I don't think that's
the behavior that the user is going to want.

Right, this will happen:

pg_combinebackup: error: unable to use accelerated copy when manifest
checksums are to be calculated. Use --no-manifest

Are you saying we should just silently override the copy method and do
the copy block by block? I'm not strongly opposed to that, but it feels
wrong to just ignore that the user explicitly requested cloning, and I'm
not sure why should this be different from any other case when the user
requests incompatible combination of options and/or options that are not
supported on the current configuration.

Why not just to tell the user to use the correct parameters, i.e. either
remove --clone or add --no-manifest?

FWIW I now realize it actually fails a bit earlier than I thought - when
parsing the options, not in copy_file. But then some checks (if a given
copy method is supported) happen in the copy functions. I wonder if it'd
be better/possible to do all of that in one place, not sure.

Also, the message only suggests to use --no-manifest. It probably should
suggest removing --clone too.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#15)

Re: pg_upgrade --copy-file-range

On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Right, this will happen:

pg_combinebackup: error: unable to use accelerated copy when manifest
checksums are to be calculated. Use --no-manifest

Are you saying we should just silently override the copy method and do
the copy block by block?

Yes.

I'm not strongly opposed to that, but it feels
wrong to just ignore that the user explicitly requested cloning, and I'm
not sure why should this be different from any other case when the user
requests incompatible combination of options and/or options that are not
supported on the current configuration.

I don't feel like copying block-by-block when that's needed to compute
a checksum is ignoring what the user requested. I mean, if we'd had to
perform reconstruction rather than copying an entire file, we would
have done that regardless of whether --clone had been there, and I
don't see why the need-to-compute-a-checksum case is any different. I
think we should view a flag like --clone as specifying how to copy a
file when we don't need to do anything but copy it. I don't think it
should dictate that we're not allowed to perform other processing when
that other processing is required.

From my point of view, this is not a case of incompatible options
having been specified. If you specify run pg_basebackup with both
--format=p and --format=t, those are incompatible options; the backup
can be done one way or the other, but not both at once. But here
there's no such conflict. Block-by-block copying and fast-copying can
happen as part of the same operation, as in the example that I showed,
where some files need the block-by-block copying and some can be
fast-copied. The user is entitled to specify which fast-copying method
they would like to have used for the files where fast-copying is
possible without getting a failure just because it isn't possible for
every single file.

Or to say it the other way around, if there's 1 file that needs to be
copied block by block and 99 files that can be fast-copied, you want
to force the user to the block-by-block method for all 100 files. I
want to force the use of the block-by-block method for the 1 file
where that's the only valid method, and let them choose what they want
to do for the other 99.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Robert Haas (#16)

Re: pg_upgrade --copy-file-range

Hmm, this discussion seems to assume that we only use
copy_file_range() to copy/clone whole segment files, right? That's
great and may even get most of the available benefit given typical
databases with many segments of old data that never changes, but... I
think copy_write_range() allows us to go further than the other
whole-file clone techniques: we can stitch together parts of an old
backup segment file and an incremental backup to create a new file.
If you're interested in minimising disk use while also removing
dependencies on the preceding chain of backups, then it might make
sense to do that even if you *also* have to read the data to compute
the checksums, I think? That's why I mentioned it: if
copy_file_range() (ie sub-file-level block sharing) is a solution in
search of a problem, has the world ever seen a better problem than
pg_combinebackup?

#18

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Thomas Munro (#17)

Re: pg_upgrade --copy-file-range

On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hmm, this discussion seems to assume that we only use
copy_file_range() to copy/clone whole segment files, right? That's
great and may even get most of the available benefit given typical
databases with many segments of old data that never changes, but... I
think copy_write_range() allows us to go further than the other
whole-file clone techniques: we can stitch together parts of an old
backup segment file and an incremental backup to create a new file.
If you're interested in minimising disk use while also removing
dependencies on the preceding chain of backups, then it might make
sense to do that even if you *also* have to read the data to compute
the checksums, I think? That's why I mentioned it: if
copy_file_range() (ie sub-file-level block sharing) is a solution in
search of a problem, has the world ever seen a better problem than
pg_combinebackup?

That makes sense; it's just a different part of the code than I
thought we were talking about.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Robert Haas (#16)

Re: pg_upgrade --copy-file-range

On 3/22/24 19:40, Robert Haas wrote:

On Fri, Mar 22, 2024 at 1:22 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Right, this will happen:

pg_combinebackup: error: unable to use accelerated copy when manifest
checksums are to be calculated. Use --no-manifest

Are you saying we should just silently override the copy method and do
the copy block by block?

Yes.

I'm not strongly opposed to that, but it feels
wrong to just ignore that the user explicitly requested cloning, and I'm
not sure why should this be different from any other case when the user
requests incompatible combination of options and/or options that are not
supported on the current configuration.

I don't feel like copying block-by-block when that's needed to compute
a checksum is ignoring what the user requested. I mean, if we'd had to
perform reconstruction rather than copying an entire file, we would
have done that regardless of whether --clone had been there, and I
don't see why the need-to-compute-a-checksum case is any different. I
think we should view a flag like --clone as specifying how to copy a
file when we don't need to do anything but copy it. I don't think it
should dictate that we're not allowed to perform other processing when
that other processing is required.

From my point of view, this is not a case of incompatible options
having been specified. If you specify run pg_basebackup with both
--format=p and --format=t, those are incompatible options; the backup
can be done one way or the other, but not both at once. But here
there's no such conflict. Block-by-block copying and fast-copying can
happen as part of the same operation, as in the example that I showed,
where some files need the block-by-block copying and some can be
fast-copied. The user is entitled to specify which fast-copying method
they would like to have used for the files where fast-copying is
possible without getting a failure just because it isn't possible for
every single file.

Or to say it the other way around, if there's 1 file that needs to be
copied block by block and 99 files that can be fast-copied, you want
to force the user to the block-by-block method for all 100 files. I
want to force the use of the block-by-block method for the 1 file
where that's the only valid method, and let them choose what they want
to do for the other 99.

OK, that makes sense. Here's a patch that should work like this - in
copy_file we check if we need to calculate checksums, and either use the
requested copy method, or fall back to the block-by-block copy.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Robert Haas (#18)

Re: pg_upgrade --copy-file-range

On 3/23/24 13:38, Robert Haas wrote:

On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hmm, this discussion seems to assume that we only use
copy_file_range() to copy/clone whole segment files, right? That's
great and may even get most of the available benefit given typical
databases with many segments of old data that never changes, but... I
think copy_write_range() allows us to go further than the other
whole-file clone techniques: we can stitch together parts of an old
backup segment file and an incremental backup to create a new file.
If you're interested in minimising disk use while also removing
dependencies on the preceding chain of backups, then it might make
sense to do that even if you *also* have to read the data to compute
the checksums, I think? That's why I mentioned it: if
copy_file_range() (ie sub-file-level block sharing) is a solution in
search of a problem, has the world ever seen a better problem than
pg_combinebackup?

That makes sense; it's just a different part of the code than I
thought we were talking about.

Yeah, that's in write_reconstructed_file() and the patch does not touch
that at all. I agree it would be nice to use copy_file_range() in this
part too, and it doesn't seem it'd be that hard to do, I think.

It seems we'd just need a "fork" that either calls pread/pwrite or
copy_file_range, depending on checksums and what was requested.

BTW is there a reason why the code calls "write" and not "pg_pwrite"?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Tomas Vondra (#20)

#22

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#19)

#23

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#20)

#24

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#21)

#25

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Jakub Wartak (#24)

#26

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Robert Haas (#22)

#27

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#25)

#28

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#26)

#29

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Robert Haas (#28)

#30

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#25)

#31

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Robert Haas (#30)

#32

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#31)

#33

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#32)

#34

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Thomas Munro (#32)

#35

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#34)

#36

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Thomas Munro (#35)

#37

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Tomas Vondra (#36)

#38

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#37)

#39

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Thomas Munro (#38)

#40

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#37)

#41

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Jakub Wartak (#40)

#42

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#41)

#43

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Jakub Wartak (#42)

#44

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Tomas Vondra (#43)

#45

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Tomas Vondra (#44)

#46

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 2 years ago

In reply to: Tomas Vondra (#45)

pg_upgrade --copy-file-range

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: