file cloning in pg_upgrade and CREATE DATABASE

Started by Peter Eisentrautalmost 8 years ago42 messages
#1Peter Eisentraut
peter.eisentraut@2ndquadrant.com
1 attachment(s)

Here is another attempt at implementing file cloning for pg_upgrade and
CREATE DATABASE. The idea is to take advantage of file systems that can
make copy-on-write clones, which would make the copy run much faster.
For pg_upgrade, this will give the performance of --link mode without
the associated drawbacks.

There have been patches proposed previously [0]/messages/by-id/513C0E7C.5080606@socialserve.com[1]/messages/by-id/20140213030731.GE4831@momjian.us -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services. The concerns there
were mainly that they required a Linux-specific ioctl() call and only
worked for Btrfs.

Some new things have happened since then:

- XFS has (optional) reflink support. This file system is probably more
widely used than Btrfs.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

So altogether this feature will be more widely usable and less ugly to
implement. Note, however, that you will currently need literally the
latest glibc release, so it probably won't be accessible right now
unless you are using Fedora 28 for example. (This is the
copy_file_range() function that had us recently rename the same function
in pg_rewind.)

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

similar for a CREATE DATABASE from a large template

Even if you don't have a file system with cloning support, the special
library calls make copying faster. For example, on APFS, in this
example, an unpatched CREATE DATABASE takes 30 seconds, with the library
call (but without cloning) it takes 10 seconds.

For amusement/bewilderment, without the recent flush optimization on
APFS, this takes 2 minutes 30 seconds. I suppose this optimization will
now actually obsolete, since macOS will no longer hit that code.

[0]: /messages/by-id/513C0E7C.5080606@socialserve.com
/messages/by-id/513C0E7C.5080606@socialserve.com

[1]: /messages/by-id/20140213030731.GE4831@momjian.us -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
/messages/by-id/20140213030731.GE4831@momjian.us
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Use-file-cloning-in-pg_upgrade-and-CREATE-DATABASE.patchtext/plain; charset=UTF-8; name=0001-Use-file-cloning-in-pg_upgrade-and-CREATE-DATABASE.patch; x-mac-creator=0; x-mac-type=0Download
From 56b5b574f6d900d5eb4932be499cf3bae0e7ba86 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Tue, 20 Feb 2018 10:41:16 -0500
Subject: [PATCH] Use file cloning in pg_upgrade and CREATE DATABASE

For file copying in pg_upgrade and CREATE DATABASE, use special file
cloning calls if available.  This makes the copying faster and more
space efficient.  For pg_upgrade, this achieves speed similar to --link
mode without the associated drawbacks.

On Linux, use copy_file_range().  This supports file cloning
automatically on Btrfs and XFS (if formatted with reflink support).  On
macOS, use copyfile(), which supports file cloning on APFS.

Even on file systems without cloning/reflink support, this is faster
than the existing code, because it avoids copying the file contents out
of kernel space and allows the OS to apply other optimizations.
---
 configure                          |  2 +-
 configure.in                       |  2 +-
 doc/src/sgml/ref/pgupgrade.sgml    | 11 ++++++++
 src/backend/storage/file/copydir.c | 55 +++++++++++++++++++++++++++++++++-----
 src/bin/pg_upgrade/file.c          | 37 ++++++++++++++++++++++++-
 src/include/pg_config.h.in         |  6 +++++
 6 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/configure b/configure
index 7dcca506f8..eb8b321723 100755
--- a/configure
+++ b/configure
@@ -13079,7 +13079,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copy_file_range copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 4d26034579..dfe3507b25 100644
--- a/configure.in
+++ b/configure.in
@@ -1425,7 +1425,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copy_file_range copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 6dafb404a1..3873e71dd1 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -737,6 +737,17 @@ <title>Notes</title>
    is down.
   </para>
 
+  <para>
+   In PostgreSQL 11 and later, <application>pg_upgrade</application>
+   automatically uses efficient file cloning (also known as
+   <quote>reflinks</quote>) on some operating systems and file systems.  This
+   can result in near-instantaneous copying of the data files, giving the
+   speed advantages of <option>-k</option>/<option>--link</option> while
+   leaving the old cluster untouched.  At present, this is supported on Linux
+   (kernel 4.5 or later, glibc 2.27 or later) with Btrfs and XFS (on file
+   systems created with reflink support, which is not the default for XFS at
+   this writing), and on macOS with APFS.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index ca6342db0d..cd6398d69a 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,6 +21,9 @@
 #include <fcntl.h>
 #include <unistd.h>
 #include <sys/stat.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
 
 #include "storage/copydir.h"
 #include "storage/fd.h"
@@ -74,7 +77,22 @@ copydir(char *fromdir, char *todir, bool recurse)
 				copydir(fromfile, tofile, true);
 		}
 		else if (S_ISREG(fst.st_mode))
+		{
+#ifdef HAVE_COPYFILE
+			if (copyfile(fromfile, tofile, NULL,
+#ifdef COPYFILE_CLONE
+						 COPYFILE_CLONE
+#else
+						 COPYFILE_DATA
+#endif
+					) < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not copy file \"%s\" to \"%s\": %m", fromfile, tofile)));
+#else
 			copy_file(fromfile, tofile);
+#endif
+		}
 	}
 	FreeDir(xldir);
 
@@ -126,12 +144,17 @@ copydir(char *fromdir, char *todir, bool recurse)
 void
 copy_file(char *fromfile, char *tofile)
 {
-	char	   *buffer;
 	int			srcfd;
 	int			dstfd;
+#ifdef HAVE_COPY_FILE_RANGE
+	struct stat stat;
+	size_t		len;
+#else
+	char	   *buffer;
 	int			nbytes;
 	off_t		offset;
 	off_t		flush_offset;
+#endif
 
 	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
@@ -148,9 +171,6 @@ copy_file(char *fromfile, char *tofile)
 #define FLUSH_DISTANCE (1024 * 1024)
 #endif
 
-	/* Use palloc to ensure we get a maxaligned buffer */
-	buffer = palloc(COPY_BUF_SIZE);
-
 	/*
 	 * Open the files
 	 */
@@ -166,6 +186,28 @@ copy_file(char *fromfile, char *tofile)
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tofile)));
 
+#ifdef HAVE_COPY_FILE_RANGE
+	if (fstat(srcfd, &stat) < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not stat file \"%s\": %m", fromfile)));
+
+	len = stat.st_size;
+
+	do {
+		ssize_t ret = copy_file_range(srcfd, NULL, dstfd, NULL, len, 0);
+		if (ret < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not copy file \"%s\" to \"%s\": %m",
+							fromfile, tofile)));
+
+		len -= ret;
+	} while (len > 0);
+#else
+	/* Use palloc to ensure we get a maxaligned buffer */
+	buffer = palloc(COPY_BUF_SIZE);
+
 	/*
 	 * Do the data copying.
 	 */
@@ -213,12 +255,13 @@ copy_file(char *fromfile, char *tofile)
 	if (offset > flush_offset)
 		pg_flush_data(dstfd, flush_offset, offset - flush_offset);
 
+	pfree(buffer);
+#endif
+
 	if (CloseTransientFile(dstfd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tofile)));
 
 	CloseTransientFile(srcfd);
-
-	pfree(buffer);
 }
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index f38bfacf02..f05fd9db9c 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -17,6 +17,9 @@
 
 #include <sys/stat.h>
 #include <fcntl.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
 
 
 #ifdef WIN32
@@ -34,10 +37,25 @@ void
 copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName)
 {
-#ifndef WIN32
+#ifdef HAVE_COPYFILE
+	if (copyfile(src, dst, NULL,
+#ifdef COPYFILE_CLONE
+				 COPYFILE_CLONE
+#else
+				 COPYFILE_DATA
+#endif
+			) < 0)
+		pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+				 schemaName, relName, src, dst, strerror(errno));
+#elif !defined(WIN32)
 	int			src_fd;
 	int			dest_fd;
+#ifdef HAVE_COPY_FILE_RANGE
+	struct stat stat;
+	size_t		len;
+#else
 	char	   *buffer;
+#endif
 
 	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
 		pg_fatal("error while copying relation \"%s.%s\": could not open file \"%s\": %s\n",
@@ -48,6 +66,22 @@ copyFile(const char *src, const char *dst,
 		pg_fatal("error while copying relation \"%s.%s\": could not create file \"%s\": %s\n",
 				 schemaName, relName, dst, strerror(errno));
 
+#ifdef HAVE_COPY_FILE_RANGE
+	if (fstat(src_fd, &stat) < 0)
+		pg_fatal("could not stat file \"%s\": %s",
+				 src, strerror(errno));
+
+	len = stat.st_size;
+
+	do {
+		ssize_t ret = copy_file_range(src_fd, NULL, dest_fd, NULL, len, 0);
+		if (ret < 0)
+			pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+
+		len -= ret;
+	} while (len > 0);
+#else
 	/* copy in fairly large chunks for best efficiency */
 #define COPY_BUF_SIZE (50 * BLCKSZ)
 
@@ -77,6 +111,7 @@ copyFile(const char *src, const char *dst,
 	}
 
 	pg_free(buffer);
+#endif
 	close(src_fd);
 	close(dest_fd);
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f98f773ff0..38e88e0395 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -114,6 +114,12 @@
 /* Define to 1 if your compiler handles computed gotos. */
 #undef HAVE_COMPUTED_GOTO
 
+/* Define to 1 if you have the `copyfile' function. */
+#undef HAVE_COPYFILE
+
+/* Define to 1 if you have the `copy_file_range' function. */
+#undef HAVE_COPY_FILE_RANGE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 

base-commit: 9a44a26b65d3d36867267624b76d3dea3dc4f6f6
-- 
2.16.2

#2Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#1)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Feb 20, 2018 at 10:00 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

similar for a CREATE DATABASE from a large template

Even if you don't have a file system with cloning support, the special
library calls make copying faster. For example, on APFS, in this
example, an unpatched CREATE DATABASE takes 30 seconds, with the library
call (but without cloning) it takes 10 seconds.

Nice results.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#1)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 02/21/2018 04:00 AM, Peter Eisentraut wrote:

...

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

similar for a CREATE DATABASE from a large template

Nice improvement, of course. How does that affect performance on the
cloned database? If I understand this correctly, it essentially enables
CoW on the files, so what's the overhead on that? It'd be unfortunate to
speed up CREATE DATABASE only to get degraded performance later.

In any case, I find this interesting mainly for pg_upgrade use case. On
running systems I think the main issue with CREATE DATABASE is that it
forces a checkpoint.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 2/21/18 18:57, Tomas Vondra wrote:

Nice improvement, of course. How does that affect performance on the
cloned database? If I understand this correctly, it essentially enables
CoW on the files, so what's the overhead on that? It'd be unfortunate to
speed up CREATE DATABASE only to get degraded performance later.

I ran a little test (on APFS and XFS): Create a large (unlogged) table,
copy the database, then delete everything from the table in the copy.
That should need to CoW all the blocks. It has about the same
performance with cloning, possibly slightly faster.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#1)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Feb 20, 2018 at 10:00:04PM -0500, Peter Eisentraut wrote:

Some new things have happened since then:

- XFS has (optional) reflink support. This file system is probably more
widely used than Btrfs.

Btrfs is still in development, there are I think no many people who
would use it in production.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

So copyfile() is only part of macos? I am not able to find references
in FreeBSD, NetBSD or OpenBSD, but I may be missing something.

So altogether this feature will be more widely usable and less ugly to
implement. Note, however, that you will currently need literally the
latest glibc release, so it probably won't be accessible right now
unless you are using Fedora 28 for example. (This is the
copy_file_range() function that had us recently rename the same function
in pg_rewind.)

For reference, Debian SID is using glibc 2.27. ArchLinux is still on
2.26.

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

Interesting. I'll try to test that on an XFS partition and see if I can
see a difference. For now I have just read through the patch.

+#ifdef HAVE_COPYFILE
+	if (copyfile(fromfile, tofile, NULL,
+#ifdef COPYFILE_CLONE
+			 COPYFILE_CLONE
+#else
+               	 COPYFILE_DATA
+#endif
+                       ) < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not copy file \"%s\" to \"%s\": %m", fromfile, tofile)));
+#else
        copy_file(fromfile, tofile);
+#endif

Any backend-side callers of copy_file() would not benefit from
copyfile() on OSX. Shouldn't all that handling be inside copy_file(),
similarly to what your patch actually does for pg_upgrade? I think that
you should also consider fcopyfile() instead of copyfile() as it works
directly on the file descriptors and share the same error handling as
the others.
--
Michael

#6Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#5)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Mon, Mar 19, 2018 at 04:06:36PM +0900, Michael Paquier wrote:

Any backend-side callers of copy_file() would not benefit from
copyfile() on OSX. Shouldn't all that handling be inside copy_file(),
similarly to what your patch actually does for pg_upgrade? I think that
you should also consider fcopyfile() instead of copyfile() as it works
directly on the file descriptors and share the same error handling as
the others.

Two other things I have noticed as well:
1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I
think when copying data from source to target using the local mode of
pg_rewind. This could really improve cases where new relations are
added after a promotion.
2) XLogFileCopy() uses a copy logic as well. For large segments things
could be improved, however we need to be careful about filling in the
end of segments with zeros.
--
Michael

#7Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#6)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Mon, Mar 19, 2018 at 04:14:15PM +0900, Michael Paquier wrote:

Two other things I have noticed as well:
1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I
think when copying data from source to target using the local mode of
pg_rewind. This could really improve cases where new relations are
added after a promotion.
2) XLogFileCopy() uses a copy logic as well. For large segments things
could be improved, however we need to be careful about filling in the
end of segments with zeros.

I have been thinking about this patch over the night, and here is a list
of bullet points which would be nice to tackle:
- Remove the current diff in copydir.
- Extend copy_file so as it is able to use fcopyfile.
- Move the work done in pg_upgrade into a common API which can as well
be used by pg_rewind as well. One place would be to have a
frontend-only API in src/common which does the leg work. I would
recommend working only on file descriptors as well for consistency with
copy_file_range.
- Add proper wait events for the backend calls. Those are missing for
copy_file_range and copyfile.
- For XLogFileCopy, the problem may be trickier as the tail of a segment
is filled with zeroes, so dropping it from the first version of the
patch sounds wiser.

Patch is switched as waiting on author, I have set myself as a
reviewer.

Thanks,
--
Michael

#8Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#7)
1 attachment(s)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 3/19/18 22:58, Michael Paquier wrote:

I have been thinking about this patch over the night, and here is a list
of bullet points which would be nice to tackle:
- Remove the current diff in copydir.

done

- Extend copy_file so as it is able to use fcopyfile.

fcopyfile() does not support cloning. (This is not documented.)

- Move the work done in pg_upgrade into a common API which can as well
be used by pg_rewind as well. One place would be to have a
frontend-only API in src/common which does the leg work. I would
recommend working only on file descriptors as well for consistency with
copy_file_range.

pg_upgrade copies files, whereas pg_rewind needs to copy file ranges.
So I don't think this is going to be a good match.

We could add support for using Linux copy_file_range() in pg_rewind, but
that would work a bit differently. I also don't have a good sense of
how to test the performance of that.

Another thing to think about is that we go through some trouble to
initialize new WAL files so that the disk space is fully allocated. If
we used file cloning calls in pg_rewind, that would potentially
invalidate some of that. At least, we'd have to think through this more
carefully.

- Add proper wait events for the backend calls. Those are missing for
copy_file_range and copyfile.

done

- For XLogFileCopy, the problem may be trickier as the tail of a segment
is filled with zeroes, so dropping it from the first version of the
patch sounds wiser.

Seems like a possible follow-on project. But see also under pg_rewind
above.

Another oddity is that pg_upgrade uses CopyFile() on Windows, but the
backend does not. The Git log shows that the backend used to use
CopyFile(), but that was then removed when the generic code was added,
but when pg_upgrade was imported, it came with the CopyFile() call.

I suspect the CopyFile() call can be quite a bit faster, so we should
consider adding it back in. Or if not, remove it from pg_upgrade.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-0001-Use-file-cloning-in-pg_upgrade-and-CREATE-DATABAS.patchtext/plain; charset=UTF-8; name=v2-0001-Use-file-cloning-in-pg_upgrade-and-CREATE-DATABAS.patch; x-mac-creator=0; x-mac-type=0Download
From bd8fe105f6b1c64098e344c4a7d0fc9c94d2e31d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Tue, 20 Mar 2018 10:21:47 -0400
Subject: [PATCH v2] Use file cloning in pg_upgrade and CREATE DATABASE

For file copying in pg_upgrade and CREATE DATABASE, use special file
cloning calls if available.  This makes the copying faster and more
space efficient.  For pg_upgrade, this achieves speed similar to --link
mode without the associated drawbacks.  Other backend users of copydir.c
will also take advantage of these changes, but the performance
improvement will probably not be as noticeable there.

On Linux, use copy_file_range().  This supports file cloning
automatically on Btrfs and XFS (if formatted with reflink support).  On
macOS, use copyfile(), which supports file cloning on APFS.

Even on file systems without cloning/reflink support, this is faster
than the existing code, because it avoids copying the file contents out
of kernel space and allows the OS to apply other optimizations.
---
 configure                          |  2 +-
 configure.in                       |  2 +-
 doc/src/sgml/monitoring.sgml       |  8 +++-
 doc/src/sgml/ref/pgupgrade.sgml    | 11 +++++
 src/backend/postmaster/pgstat.c    |  3 ++
 src/backend/storage/file/copydir.c | 84 ++++++++++++++++++++++++++++++--------
 src/backend/storage/file/reinit.c  |  3 +-
 src/bin/pg_upgrade/file.c          | 56 +++++++++++++++++++------
 src/include/pg_config.h.in         |  6 +++
 src/include/pgstat.h               |  1 +
 10 files changed, 141 insertions(+), 35 deletions(-)

diff --git a/configure b/configure
index 3943711283..f27c78f63a 100755
--- a/configure
+++ b/configure
@@ -13085,7 +13085,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copy_file_range copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 1babdbb755..7eb8673753 100644
--- a/configure.in
+++ b/configure.in
@@ -1428,7 +1428,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copy_file_range copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3bc4de57d5..02029e81bc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1418,7 +1418,7 @@ <title><structname>wait_event</structname> Description</title>
          <entry>Waiting to apply WAL at recovery because it is delayed.</entry>
         </row>
         <row>
-         <entry morerows="65"><literal>IO</literal></entry>
+         <entry morerows="66"><literal>IO</literal></entry>
          <entry><literal>BufFileRead</literal></entry>
          <entry>Waiting for a read from a buffered file.</entry>
         </row>
@@ -1446,6 +1446,12 @@ <title><structname>wait_event</structname> Description</title>
          <entry><literal>ControlFileWriteUpdate</literal></entry>
          <entry>Waiting for a write to update the control file.</entry>
         </row>
+        <row>
+         <entry><literal>CopyFileCopy</literal></entry>
+         <entry>Waiting for a file copy operation (if the copying is done by
+         an operating system call rather than as separate read and write
+         operations).</entry>
+        </row>
         <row>
          <entry><literal>CopyFileRead</literal></entry>
          <entry>Waiting for a read during a file copy operation.</entry>
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 6dafb404a1..3873e71dd1 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -737,6 +737,17 @@ <title>Notes</title>
    is down.
   </para>
 
+  <para>
+   In PostgreSQL 11 and later, <application>pg_upgrade</application>
+   automatically uses efficient file cloning (also known as
+   <quote>reflinks</quote>) on some operating systems and file systems.  This
+   can result in near-instantaneous copying of the data files, giving the
+   speed advantages of <option>-k</option>/<option>--link</option> while
+   leaving the old cluster untouched.  At present, this is supported on Linux
+   (kernel 4.5 or later, glibc 2.27 or later) with Btrfs and XFS (on file
+   systems created with reflink support, which is not the default for XFS at
+   this writing), and on macOS with APFS.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 96ba216387..4feb3a5289 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3744,6 +3744,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
 			event_name = "ControlFileWriteUpdate";
 			break;
+		case WAIT_EVENT_COPY_FILE_COPY:
+			event_name = "CopyFileCopy";
+			break;
 		case WAIT_EVENT_COPY_FILE_READ:
 			event_name = "CopyFileRead";
 			break;
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index ca6342db0d..5aa9742b51 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -21,6 +21,9 @@
 #include <fcntl.h>
 #include <unistd.h>
 #include <sys/stat.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
 
 #include "storage/copydir.h"
 #include "storage/fd.h"
@@ -126,13 +129,71 @@ copydir(char *fromdir, char *todir, bool recurse)
 void
 copy_file(char *fromfile, char *tofile)
 {
-	char	   *buffer;
+#ifdef HAVE_COPYFILE
+	int			ret;
+
+	pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_COPY);
+	ret = copyfile(fromfile, tofile, NULL,
+#ifdef COPYFILE_CLONE
+				   COPYFILE_CLONE
+#else
+				   COPYFILE_DATA
+#endif
+		);
+	pgstat_report_wait_end();
+	if (ret < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not copy file \"%s\" to \"%s\": %m", fromfile, tofile)));
+#else
 	int			srcfd;
 	int			dstfd;
+#ifdef HAVE_COPY_FILE_RANGE
+	struct stat stat;
+	size_t		len;
+#else
+	char	   *buffer;
 	int			nbytes;
 	off_t		offset;
 	off_t		flush_offset;
+#endif
+
+	/*
+	 * Open the files
+	 */
+	srcfd = OpenTransientFile(fromfile, O_RDONLY | PG_BINARY);
+	if (srcfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", fromfile)));
+
+	dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	if (dstfd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", tofile)));
+
+#ifdef HAVE_COPY_FILE_RANGE
+	if (fstat(srcfd, &stat) < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not stat file \"%s\": %m", fromfile)));
+
+	len = stat.st_size;
 
+	do {
+		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_COPY);
+		ssize_t ret = copy_file_range(srcfd, NULL, dstfd, NULL, len, 0);
+		pgstat_report_wait_end();
+		if (ret < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not copy file \"%s\" to \"%s\": %m",
+							fromfile, tofile)));
+
+		len -= ret;
+	} while (len > 0);
+#else
 	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
 
@@ -151,21 +212,6 @@ copy_file(char *fromfile, char *tofile)
 	/* Use palloc to ensure we get a maxaligned buffer */
 	buffer = palloc(COPY_BUF_SIZE);
 
-	/*
-	 * Open the files
-	 */
-	srcfd = OpenTransientFile(fromfile, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", fromfile)));
-
-	dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (dstfd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m", tofile)));
-
 	/*
 	 * Do the data copying.
 	 */
@@ -213,12 +259,14 @@ copy_file(char *fromfile, char *tofile)
 	if (offset > flush_offset)
 		pg_flush_data(dstfd, flush_offset, offset - flush_offset);
 
+	pfree(buffer);
+#endif
+
 	if (CloseTransientFile(dstfd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tofile)));
 
 	CloseTransientFile(srcfd);
-
-	pfree(buffer);
+#endif
 }
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 92363ae6ad..2614b27307 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -314,8 +314,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		FreeDir(dbspace_dir);
 
 		/*
-		 * copy_file() above has already called pg_flush_data() on the files
-		 * it created. Now we need to fsync those files, because a checkpoint
+		 * Now we need to fsync the copied files, because a checkpoint
 		 * won't do it for us while we're in recovery. We do this in a
 		 * separate pass to allow the kernel to perform all the flushes
 		 * (especially the metadata ones) at once.
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index f38bfacf02..4354b64b50 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -17,6 +17,9 @@
 
 #include <sys/stat.h>
 #include <fcntl.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
 
 
 #ifdef WIN32
@@ -34,10 +37,32 @@ void
 copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName)
 {
-#ifndef WIN32
+#if defined(HAVE_COPYFILE)
+	if (copyfile(src, dst, NULL,
+#ifdef COPYFILE_CLONE
+				 COPYFILE_CLONE
+#else
+				 COPYFILE_DATA
+#endif
+			) < 0)
+		pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+				 schemaName, relName, src, dst, strerror(errno));
+#elif defined(WIN32)
+	if (CopyFile(src, dst, true) == 0)
+	{
+		_dosmaperr(GetLastError());
+		pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+				 schemaName, relName, src, dst, strerror(errno));
+	}
+#else
 	int			src_fd;
 	int			dest_fd;
+#ifdef HAVE_COPY_FILE_RANGE
+	struct stat stat;
+	size_t		len;
+#else
 	char	   *buffer;
+#endif
 
 	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
 		pg_fatal("error while copying relation \"%s.%s\": could not open file \"%s\": %s\n",
@@ -48,6 +73,22 @@ copyFile(const char *src, const char *dst,
 		pg_fatal("error while copying relation \"%s.%s\": could not create file \"%s\": %s\n",
 				 schemaName, relName, dst, strerror(errno));
 
+#ifdef HAVE_COPY_FILE_RANGE
+	if (fstat(src_fd, &stat) < 0)
+		pg_fatal("could not stat file \"%s\": %s",
+				 src, strerror(errno));
+
+	len = stat.st_size;
+
+	do {
+		ssize_t ret = copy_file_range(src_fd, NULL, dest_fd, NULL, len, 0);
+		if (ret < 0)
+			pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+
+		len -= ret;
+	} while (len > 0);
+#else
 	/* copy in fairly large chunks for best efficiency */
 #define COPY_BUF_SIZE (50 * BLCKSZ)
 
@@ -77,19 +118,10 @@ copyFile(const char *src, const char *dst,
 	}
 
 	pg_free(buffer);
+#endif
 	close(src_fd);
 	close(dest_fd);
-
-#else							/* WIN32 */
-
-	if (CopyFile(src, dst, true) == 0)
-	{
-		_dosmaperr(GetLastError());
-		pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
-				 schemaName, relName, src, dst, strerror(errno));
-	}
-
-#endif							/* WIN32 */
+#endif
 }
 
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f98f773ff0..38e88e0395 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -114,6 +114,12 @@
 /* Define to 1 if your compiler handles computed gotos. */
 #undef HAVE_COMPUTED_GOTO
 
+/* Define to 1 if you have the `copyfile' function. */
+#undef HAVE_COPYFILE
+
+/* Define to 1 if you have the `copy_file_range' function. */
+#undef HAVE_COPY_FILE_RANGE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..934bce0fa9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -863,6 +863,7 @@ typedef enum
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
 	WAIT_EVENT_CONTROL_FILE_WRITE,
 	WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+	WAIT_EVENT_COPY_FILE_COPY,
 	WAIT_EVENT_COPY_FILE_READ,
 	WAIT_EVENT_COPY_FILE_WRITE,
 	WAIT_EVENT_DATA_FILE_EXTEND,

base-commit: 13c7c65ec900a30bcddcb27f5fd138dcdbc2ca2e
-- 
2.16.2

#9Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#8)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Mar 20, 2018 at 10:55:04AM -0400, Peter Eisentraut wrote:

On 3/19/18 22:58, Michael Paquier wrote:

- Extend copy_file so as it is able to use fcopyfile.

fcopyfile() does not support cloning. (This is not documented.)

You are right. I have been reading the documentation here to get an
idea as I don't have a macos system at hand:
https://www.unix.com/man-page/osx/3/fcopyfile/

However I have bumped into that:
http://www.openradar.me/30706426

Future versions will be visibly fixed.

- Move the work done in pg_upgrade into a common API which can as well
be used by pg_rewind as well. One place would be to have a
frontend-only API in src/common which does the leg work. I would
recommend working only on file descriptors as well for consistency with
copy_file_range.

pg_upgrade copies files, whereas pg_rewind needs to copy file ranges.
So I don't think this is going to be a good match.

We could add support for using Linux copy_file_range() in pg_rewind, but
that would work a bit differently. I also don't have a good sense of
how to test the performance of that.

One simple way to test that would be to limit the time it takes to scan
the WAL segments on the target so as the filemap is computed quickly,
and create many, say gigabyte-size relations on the promoted source
which will need to be copied from the source to the target.

Another thing to think about is that we go through some trouble to
initialize new WAL files so that the disk space is fully allocated. If
we used file cloning calls in pg_rewind, that would potentially
invalidate some of that. At least, we'd have to think through this more
carefully.

Agreed. Let's keep in mind such things but come with a sane, first cut
of this patch based on the time remaining in this commit fest.

- Add proper wait events for the backend calls. Those are missing for
copy_file_range and copyfile.

done

+         <entry><literal>CopyFileCopy</literal></entry>
+         <entry>Waiting for a file copy operation (if the copying is done by
+         an operating system call rather than as separate read and write
+         operations).</entry>
CopyFileCopy is... Redundant.  Perhaps CopyFileSystem or CopyFileRange?

- For XLogFileCopy, the problem may be trickier as the tail of a segment
is filled with zeroes, so dropping it from the first version of the
patch sounds wiser.

Seems like a possible follow-on project. But see also under pg_rewind
above.

No objections to do that in the future for both.

Another oddity is that pg_upgrade uses CopyFile() on Windows, but the
backend does not. The Git log shows that the backend used to use
CopyFile(), but that was then removed when the generic code was added,
but when pg_upgrade was imported, it came with the CopyFile() call.

You mean 558730ac, right?

I suspect the CopyFile() call can be quite a bit faster, so we should
consider adding it back in. Or if not, remove it from pg_upgrade.

Hm. The proposed patch also removes an important property of what
happens now in copy_file: the copied files are periodically synced to
avoid spamming the cache, so for some loads wouldn't this cause a
performance regression?

At least on Linux it is possible to rely on sync_file_range which is
called via pg_flush_data, so it seems to me that we ought to roughly
keep the loop working on FLUSH_DISTANCE, and replace the calls of
read/write by copy_file_range. copyfile is only able to do a complete
file copy, so we would also lose this property as well on Linux. Even
for Windows using CopyFile would be a step backwards for the backend.

pg_upgrade is different though as it copies files fully, so using both
copyfile and copy_file_range makes sense.

At the end, it seems to me that using copy_file_range has some values as
you save a set of read/write calls, but copyfile comes with its
limitations, which I think will cause side issues, so I would recommend
dropping it from a first cut of the patch for the backend.
--
Michael

#10Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#8)
Re: file cloning in pg_upgrade and CREATE DATABASE

I think this documentation change:

	+   leaving the old cluster untouched.  At present, this is supported on Linux
	                            ---------

would be better by changing "untouched" to "unmodified".

Also, it would be nice if users could easily know if pg_upgrade is going
to use COW or not because it might affect whether they choose --link or
not. Right now it seems unclear how a user would know. Can we have
pg_upgrade --check perhaps output something. Can we also have the
pg_upgrade status display indicate that too, e.g. change

Copying user relation files

to

Copying (copy-on-write) user relation files

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#11Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#9)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 3/21/18 22:38, Michael Paquier wrote:

At least on Linux it is possible to rely on sync_file_range which is
called via pg_flush_data, so it seems to me that we ought to roughly
keep the loop working on FLUSH_DISTANCE, and replace the calls of
read/write by copy_file_range. copyfile is only able to do a complete
file copy, so we would also lose this property as well on Linux.

I have shown earlier in the thread that copy_file_range in one go is
still better than doing it in pieces.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Bruce Momjian (#10)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 3/23/18 13:16, Bruce Momjian wrote:

Also, it would be nice if users could easily know if pg_upgrade is going
to use COW or not because it might affect whether they choose --link or
not. Right now it seems unclear how a user would know. Can we have
pg_upgrade --check perhaps output something. Can we also have the
pg_upgrade status display indicate that too, e.g. change

Copying user relation files

to

Copying (copy-on-write) user relation files

That would be nice, but we don't have a way to tell that, AFAICT.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#11)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Sun, Mar 25, 2018 at 09:33:38PM -0400, Peter Eisentraut wrote:

On 3/21/18 22:38, Michael Paquier wrote:

At least on Linux it is possible to rely on sync_file_range which is
called via pg_flush_data, so it seems to me that we ought to roughly
keep the loop working on FLUSH_DISTANCE, and replace the calls of
read/write by copy_file_range. copyfile is only able to do a complete
file copy, so we would also lose this property as well on Linux.

I have shown earlier in the thread that copy_file_range in one go is
still better than doing it in pieces.

f8c183a has introduced the optimization that your patch is removing,
which was discussed on this thread:
/messages/by-id/4B78906A.7020309@mark.mielke.cc
I am not much into the internals of copy_file_range, but isn't there a
risk to have a large range of blocks copied to discard potentially
useful blocks from the OS cache? That's what this patch makes me worry
about. Performance is good, but on a system where the OS cache is
heavily used for a set of hot blocks this could cause performance side
effects that I think we canot neglect.

Another thing is that 71d6d07 allowed a couple of database commands to
be more sensitive to interruptions. With large databases used as a base
template it seems to me that this would cause the interruptions to be
less responsive.
--
Michael

#14Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#13)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 3/26/18 02:15, Michael Paquier wrote:

f8c183a has introduced the optimization that your patch is removing,
which was discussed on this thread:
/messages/by-id/4B78906A.7020309@mark.mielke.cc

Note that that thread is from 2010 and talks about creation of a
database from the standard template being too slow on spinning rust,
because we fsync too often. I think we have moved well past that
problem size.

I have run some more tests on both macOS and Linux with ext4, and my
results are that the bigger the flush distance, the better. Before we
made the adjustments for APFS, we had a flush size of 64kB, now it's 1MB
and 32MB on macOS. In my tests, I see 256MB as the best across both
platforms, and not flushing early at all is only minimally worse.

You can measure this to death, and this obviously doesn't apply equally
on all systems and configurations, but clearly some of the old
assumptions from 8 years ago are no longer applicable.

I am not much into the internals of copy_file_range, but isn't there a
risk to have a large range of blocks copied to discard potentially
useful blocks from the OS cache? That's what this patch makes me worry
about. Performance is good, but on a system where the OS cache is
heavily used for a set of hot blocks this could cause performance side
effects that I think we canot neglect.

How would we go about assessing that? It's possible, but if
copy_file_range() really blows away all your in-use cache, that would be
surprising.

Another thing is that 71d6d07 allowed a couple of database commands to
be more sensitive to interruptions. With large databases used as a base
template it seems to me that this would cause the interruptions to be
less responsive.

The maximum file size that we copy is 1GB and that nowadays takes maybe
10 seconds. I think that would be an acceptable response time.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Peter Eisentraut (#14)
Re: file cloning in pg_upgrade and CREATE DATABASE

I think we have raised a number of interesting issues here which require
more deeper consideration. So I suggest to set this patch to Returned
with feedback.

Btw., I just learned that copy_file_range() only works on files on the
same device. So more arrangements will need to be made for that.

I have run some more tests on both macOS and Linux with ext4, and my> results are that the bigger the flush distance, the better. Before

we> made the adjustments for APFS, we had a flush size of 64kB, now it's
1MB> and 32MB on macOS. In my tests, I see 256MB as the best across
both> platforms, and not flushing early at all is only minimally worse.
Based on this, I suggest that we set the flush distance to 32MB on all
platforms. Not only is it faster, it avoids having different settings
on some platforms.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Peter Eisentraut (#12)
1 attachment(s)
Re: file cloning in pg_upgrade and CREATE DATABASE

I have made a major revision of this patch.

I have removed all the changes to CREATE DATABASE. That was too
contentious and we got lost in unrelated details there. The real
benefit is for pg_upgrade.

Another point was that for pg_upgrade use a user would like to know
beforehand whether reflinking would be used, which was not possible with
the copy_file_range() API. So here I have switched to using the ioctl()
call directly.

So the new interface is that pg_upgrade has a new option
--reflink={always,auto,never}. (This option name is adapted from GNU
cp.) From the documentation:

<para>
The setting <literal>always</literal> requires the use of relinks. If
they are not supported, the <application>pg_upgrade</application> run
will abort. Use this in production to limit the upgrade run time.
The setting <literal>auto</literal> uses reflinks when available,
otherwise it falls back to a normal copy. This is the default. The
setting <literal>never</literal> prevents use of reflinks and always
uses a normal copy. This can be useful to ensure that the upgraded
cluster has its disk space fully allocated and not shared with the old
cluster.
</para>

Also, pg_upgrade --check will check whether the selected option would work.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v3-0001-pg_upgrade-Allow-use-of-file-cloning.patchtext/plain; charset=UTF-8; name=v3-0001-pg_upgrade-Allow-use-of-file-cloning.patch; x-mac-creator=0; x-mac-type=0Download
From c39e40640e70e8fc4b90e762b985201a1ce9f912 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Tue, 5 Jun 2018 17:24:53 -0400
Subject: [PATCH v3] pg_upgrade: Allow use of file cloning

For file copying in pg_upgrade, allow using special file cloning calls
if available.  This makes the copying faster and more space efficient.
This achieves speed similar to --link mode without the associated
drawbacks.

Add an option --reflink to select whether file cloning is turned on,
off, or automatic.  Automatic is the default.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure                        |   2 +-
 configure.in                     |   2 +-
 doc/src/sgml/ref/pgupgrade.sgml  |  33 +++++++++
 src/bin/pg_upgrade/check.c       |   2 +
 src/bin/pg_upgrade/file.c        | 123 +++++++++++++++++++++++++++++++
 src/bin/pg_upgrade/option.c      |  14 ++++
 src/bin/pg_upgrade/pg_upgrade.h  |  15 ++++
 src/bin/pg_upgrade/relfilenode.c |  31 +++++++-
 src/include/pg_config.h.in       |   3 +
 9 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index 3d219c802b..a0eebf7462 100755
--- a/configure
+++ b/configure
@@ -14827,7 +14827,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 862d8b128d..73632bee91 100644
--- a/configure.in
+++ b/configure.in
@@ -1528,7 +1528,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 6dafb404a1..01a426f714 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,39 @@ <title>Options</title>
       <listitem><para>display version information, then exit</para></listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><literal><option>--reflink</option>={always|auto|never}</literal></term>
+      <listitem>
+       <para>
+        Determines whether <application>pg_upgrade</application>, when in copy
+        mode, should use efficient file cloning (also known as
+        <quote>reflinks</quote>) on some operating systems and file systems.
+        This can result in near-instantaneous copying of the data files,
+        giving the speed advantages of
+        <option>-k</option>/<option>--link</option> while leaving the old
+        cluster untouched.
+       </para>
+
+       <para>
+        The setting <literal>always</literal> requires the use of relinks.  If
+        they are not supported, the <application>pg_upgrade</application> run
+        will abort.  Use this in production to limit the upgrade run time.
+        The setting <literal>auto</literal> uses reflinks when available,
+        otherwise it falls back to a normal copy.  This is the default.  The
+        setting <literal>never</literal> prevents use of reflinks and always
+        uses a normal copy.  This can be useful to ensure that the upgraded
+        cluster has its disk space fully allocated and not shared with the old
+        cluster.
+       </para>
+
+       <para>
+        At present, reflinks are supported on Linux (kernel 4.5 or later) with
+        Btrfs and XFS (on file systems created with reflink support, which is
+        not the default for XFS at this writing), and on macOS with APFS.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-?</option></term>
       <term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 577db73f10..0d7a67539a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -151,6 +151,8 @@ check_new_cluster(void)
 
 	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
 		check_hard_link();
+	else if (user_opts.transfer_mode == TRANSFER_MODE_COPY && user_opts.reflink_mode != REFLINK_NEVER)
+		check_reflink();
 
 	check_is_install_user(&new_cluster);
 
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index f68211aa20..7dd8106c39 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -18,6 +18,13 @@
 
 #include <sys/stat.h>
 #include <fcntl.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
+#ifdef __linux__
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#endif
 
 
 #ifdef WIN32
@@ -93,6 +100,68 @@ copyFile(const char *src, const char *dst,
 #endif							/* WIN32 */
 }
 
+/*
+ * cloneFile()
+ *
+ * Clones/reflinks a relation file from src to dst.
+ *
+ * schemaName/relName are relation's SQL name (used for error messages only).
+ *
+ * If unsupported_ok is true, then if the cloning fails because the OS or file
+ * system don't support it, don't error, instead return false.  Otherwise,
+ * true is returned.  Based on this, the caller can then try to call
+ * copyFile() instead, for example.
+ */
+bool
+cloneFile(const char *src, const char *dst,
+		  const char *schemaName, const char *relName,
+		  bool unsupported_ok)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(src, dst, NULL, COPYFILE_CLONE_FORCE) < 0)
+	{
+		if (unsupported_ok && errno == ENOTSUP)
+			return false;
+		else
+			pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+	}
+	return true;
+#elif defined(__linux__) && defined(FICLONE)
+	int			src_fd;
+	int			dest_fd;
+
+	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+				 schemaName, relName, src, strerror(errno));
+
+	if ((dest_fd = open(dst, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						pg_file_create_mode)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not create file \"%s\": %s\n",
+				 schemaName, relName, dst, strerror(errno));
+
+	if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+	{
+		unlink(dst);
+		if (unsupported_ok && errno == EOPNOTSUPP)
+		{
+			close(src_fd);
+			close(dest_fd);
+			return false;
+		}
+		else
+			pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+	}
+
+	close(src_fd);
+	close(dest_fd);
+	return true;
+#else
+	return false;
+#endif
+}
+
 
 /*
  * linkFile()
@@ -279,6 +348,60 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 	close(src_fd);
 }
 
+void
+check_reflink(void)
+{
+	char		existing_file[MAXPGPATH];
+	char		new_link_file[MAXPGPATH];
+
+	snprintf(existing_file, sizeof(existing_file), "%s/PG_VERSION", old_cluster.pgdata);
+	snprintf(new_link_file, sizeof(new_link_file), "%s/PG_VERSION.reflinktest", new_cluster.pgdata);
+	unlink(new_link_file);		/* might fail */
+
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(existing_file, new_link_file, NULL, COPYFILE_CLONE_FORCE) < 0)
+	{
+		if (user_opts.reflink_mode == REFLINK_ALWAYS)
+			pg_fatal("could not clone file between old and new data directories: %s\n",
+					 strerror(errno));
+		else if (user_opts.check)
+			pg_log(PG_REPORT, "could not clone file between old and new data directories: %s\n",
+				   strerror(errno));
+	}
+#elif defined(__linux__) && defined(FICLONE)
+	{
+		int			src_fd;
+		int			dest_fd;
+
+		if ((src_fd = open(existing_file, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %s\n",
+					 existing_file, strerror(errno));
+
+		if ((dest_fd = open(new_link_file, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							pg_file_create_mode)) < 0)
+			pg_fatal("could not create file \"%s\": %s\n",
+					 new_link_file, strerror(errno));
+
+		if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+		{
+			if (user_opts.reflink_mode == REFLINK_ALWAYS)
+				pg_fatal("could not clone file between old and new data directories: %s\n",
+						 strerror(errno));
+			else if (user_opts.check)
+				pg_log(PG_REPORT, "could not clone file between old and new data directories: %s\n",
+					   strerror(errno));
+		}
+
+		close(src_fd);
+		close(dest_fd);
+	}
+#else
+	pg_fatal("file cloning not supported on this platform\n");
+#endif
+
+	unlink(new_link_file);
+}
+
 void
 check_hard_link(void)
 {
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 9dbc9225a6..d52a1bcee3 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -53,6 +53,9 @@ parseCommandLine(int argc, char *argv[])
 		{"retain", no_argument, NULL, 'r'},
 		{"jobs", required_argument, NULL, 'j'},
 		{"verbose", no_argument, NULL, 'v'},
+
+		{"reflink", required_argument, NULL, 1},
+
 		{NULL, 0, NULL, 0}
 	};
 	int			option;			/* Command line option */
@@ -203,6 +206,17 @@ parseCommandLine(int argc, char *argv[])
 				log_opts.verbose = true;
 				break;
 
+			case 1:
+				if (strcmp(optarg, "always") == 0)
+					user_opts.reflink_mode = REFLINK_ALWAYS;
+				else if (strcmp(optarg, "auto") == 0)
+					user_opts.reflink_mode = REFLINK_AUTO;
+				else if (strcmp(optarg, "never") == 0)
+					user_opts.reflink_mode = REFLINK_NEVER;
+				else
+					pg_fatal("invalid reflink mode: %s\n", optarg);
+				break;
+
 			default:
 				pg_fatal("Try \"%s --help\" for more information.\n",
 						 os_info.progname);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 7e5e971294..9adfc87140 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -238,6 +238,16 @@ typedef enum
 	TRANSFER_MODE_LINK
 } transferMode;
 
+/*
+ * Enumeration to denote reflink modes
+ */
+typedef enum
+{
+	REFLINK_NEVER,
+	REFLINK_AUTO,
+	REFLINK_ALWAYS
+} reflinkMode;
+
 /*
  * Enumeration to denote pg_log modes
  */
@@ -297,6 +307,7 @@ typedef struct
 	bool		check;			/* true -> ask user for permission to make
 								 * changes */
 	transferMode transfer_mode; /* copy files or link them? */
+	reflinkMode	reflink_mode;
 	int			jobs;
 } UserOpts;
 
@@ -369,10 +380,14 @@ bool		pid_lock_file_exists(const char *datadir);
 
 void copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
+bool cloneFile(const char *src, const char *dst,
+		 const char *schemaName, const char *relName,
+		 bool unsupported_ok);
 void linkFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
 void rewriteVisibilityMap(const char *fromfile, const char *tofile,
 					 const char *schemaName, const char *relName);
+void		check_reflink(void);
 void		check_hard_link(void);
 
 /* fopen_priv() is no longer different from fopen() */
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index ed604f26ca..fc00cfdfae 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -252,9 +252,34 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 		}
 		else if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
 		{
-			pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
-				   old_file, new_file);
-			copyFile(old_file, new_file, map->nspname, map->relname);
+			if (user_opts.reflink_mode == REFLINK_ALWAYS)
+			{
+				pg_log(PG_VERBOSE, "cloning \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				cloneFile(old_file, new_file, map->nspname, map->relname, false);
+			}
+			else if (user_opts.reflink_mode == REFLINK_AUTO)
+			{
+				static bool		cloning_ok = true;
+
+				pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				if (cloning_ok &&
+					!cloneFile(old_file, new_file, map->nspname, map->relname, true))
+				{
+					pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+					cloning_ok = false;
+					copyFile(old_file, new_file, map->nspname, map->relname);
+				}
+				else
+					copyFile(old_file, new_file, map->nspname, map->relname);
+			}
+			else
+			{
+				pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				copyFile(old_file, new_file, map->nspname, map->relname);
+			}
 		}
 		else
 		{
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 89b8804251..5a87a95d67 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -114,6 +114,9 @@
 /* Define to 1 if your compiler handles computed gotos. */
 #undef HAVE_COMPUTED_GOTO
 
+/* Define to 1 if you have the `copyfile' function. */
+#undef HAVE_COPYFILE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 

base-commit: 3f85c62d9e825eedd1315d249ef1ad793ca78ed4
-- 
2.17.1

#17Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#16)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Wed, Jun 6, 2018 at 11:58 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

--reflink={always,auto,never}. (This option name is adapted from GNU

...

The setting <literal>always</literal> requires the use of relinks. If

Is it supposed to be relinks or reflinks? The two lines above don't agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Robert Haas (#17)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 6/8/18 14:06, Robert Haas wrote:

Is it supposed to be relinks or reflinks? The two lines above don't agree.

It's supposed to be "reflinks". I'll fix that.

I have also used the more general term "cloning" in the documentation.
We can discuss which term we should use more.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Peter Eisentraut (#1)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

- XFS has (optional) reflink support. This file system is probably more
widely used than Btrfs.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly,
it's not in OpenZFS though I see numerous requests and discussions...
(Of course you can just clone the whole filesystem and then pg_upgrade
the clone in-place).

--
Thomas Munro
http://www.enterprisedb.com

#20Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Thomas Munro (#19)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 13.07.18 07:09, Thomas Munro wrote:

On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

- XFS has (optional) reflink support. This file system is probably more
widely used than Btrfs.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly,
it's not in OpenZFS though I see numerous requests and discussions...

I look forward to your FreeBSD patch then. ;-)

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#20)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Fri, Jul 13, 2018 at 10:22:21AM +0200, Peter Eisentraut wrote:

On 13.07.18 07:09, Thomas Munro wrote:

TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly,
it's not in OpenZFS though I see numerous requests and discussions...

I look forward to your FreeBSD patch then. ;-)

+1.
--
Michael
#22Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#16)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote:

<para>
The setting <literal>always</literal> requires the use of relinks. If
they are not supported, the <application>pg_upgrade</application> run
will abort. Use this in production to limit the upgrade run time.
The setting <literal>auto</literal> uses reflinks when available,
otherwise it falls back to a normal copy. This is the default. The
setting <literal>never</literal> prevents use of reflinks and always
uses a normal copy. This can be useful to ensure that the upgraded
cluster has its disk space fully allocated and not shared with the old
cluster.
</para>

Hm... I am wondering if we actually want the "auto" mode where we make
the option smarter automatically. I am afraid of users trying to use it
and being surprised that there is no gain while they expected some. I
would rather switch that to an on/off switch, which defaults to "off",
or simply what is available now. huge_pages=try was a bad idea as the
result is not deterministic, so I would not have more of that...

Putting CloneFile and check_reflink in a location that other frontend
binaries could use would be nice, like pg_rewind.
--
Michael

#23Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#22)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 17.07.18 08:58, Michael Paquier wrote:

Hm... I am wondering if we actually want the "auto" mode where we make
the option smarter automatically. I am afraid of users trying to use it
and being surprised that there is no gain while they expected some. I
would rather switch that to an on/off switch, which defaults to "off",
or simply what is available now. huge_pages=try was a bad idea as the
result is not deterministic, so I would not have more of that...

Why do you think that was a bad idea? Doing the best possible job by
default without requiring explicit configuration in each case seems like
an attractive feature.

Putting CloneFile and check_reflink in a location that other frontend
binaries could use would be nice, like pg_rewind.

This could be done in subsequent patches, but the previous iteration of
this patch for CREATE DATABASE integration already showed that each of
those cases needs separate consideration.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Peter Eisentraut (#23)
1 attachment(s)
Re: file cloning in pg_upgrade and CREATE DATABASE

rebased patch, no functionality changes

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v4-0001-pg_upgrade-Allow-use-of-file-cloning.patchtext/plain; charset=UTF-8; name=v4-0001-pg_upgrade-Allow-use-of-file-cloning.patch; x-mac-creator=0; x-mac-type=0Download
From 5e0f60e9cf182063f2f711251430d79282be1f93 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Sat, 1 Sep 2018 07:05:02 +0200
Subject: [PATCH v4] pg_upgrade: Allow use of file cloning

For file copying in pg_upgrade, allow using special file cloning calls
if available.  This makes the copying faster and more space efficient.
This achieves speed similar to --link mode without the associated
drawbacks.

Add an option --reflink to select whether file cloning is turned on,
off, or automatic.  Automatic is the default.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure                        |   2 +-
 configure.in                     |   2 +-
 doc/src/sgml/ref/pgupgrade.sgml  |  33 +++++++++
 src/bin/pg_upgrade/check.c       |   2 +
 src/bin/pg_upgrade/file.c        | 123 +++++++++++++++++++++++++++++++
 src/bin/pg_upgrade/option.c      |  14 ++++
 src/bin/pg_upgrade/pg_upgrade.h  |  15 ++++
 src/bin/pg_upgrade/relfilenode.c |  31 +++++++-
 src/include/pg_config.h.in       |   3 +
 9 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index dd94c5bbab..a324dd9ff7 100755
--- a/configure
+++ b/configure
@@ -15060,7 +15060,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 3280afa0da..614262fc52 100644
--- a/configure.in
+++ b/configure.in
@@ -1544,7 +1544,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copyfile dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..d994218c44 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,39 @@ <title>Options</title>
       <listitem><para>display version information, then exit</para></listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><literal><option>--reflink</option>={always|auto|never}</literal></term>
+      <listitem>
+       <para>
+        Determines whether <application>pg_upgrade</application>, when in copy
+        mode, should use efficient file cloning (also known as
+        <quote>reflinks</quote>) on some operating systems and file systems.
+        This can result in near-instantaneous copying of the data files,
+        giving the speed advantages of
+        <option>-k</option>/<option>--link</option> while leaving the old
+        cluster untouched.
+       </para>
+
+       <para>
+        The setting <literal>always</literal> requires the use of reflinks.  If
+        they are not supported, the <application>pg_upgrade</application> run
+        will abort.  Use this in production to limit the upgrade run time.
+        The setting <literal>auto</literal> uses reflinks when available,
+        otherwise it falls back to a normal copy.  This is the default.  The
+        setting <literal>never</literal> prevents use of reflinks and always
+        uses a normal copy.  This can be useful to ensure that the upgraded
+        cluster has its disk space fully allocated and not shared with the old
+        cluster.
+       </para>
+
+       <para>
+        At present, reflinks are supported on Linux (kernel 4.5 or later) with
+        Btrfs and XFS (on file systems created with reflink support, which is
+        not the default for XFS at this writing), and on macOS with APFS.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-?</option></term>
       <term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 5a78d603dc..eb1f18180a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -151,6 +151,8 @@ check_new_cluster(void)
 
 	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
 		check_hard_link();
+	else if (user_opts.transfer_mode == TRANSFER_MODE_COPY && user_opts.reflink_mode != REFLINK_NEVER)
+		check_reflink();
 
 	check_is_install_user(&new_cluster);
 
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index f68211aa20..7dd8106c39 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -18,6 +18,13 @@
 
 #include <sys/stat.h>
 #include <fcntl.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
+#ifdef __linux__
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#endif
 
 
 #ifdef WIN32
@@ -93,6 +100,68 @@ copyFile(const char *src, const char *dst,
 #endif							/* WIN32 */
 }
 
+/*
+ * cloneFile()
+ *
+ * Clones/reflinks a relation file from src to dst.
+ *
+ * schemaName/relName are relation's SQL name (used for error messages only).
+ *
+ * If unsupported_ok is true, then if the cloning fails because the OS or file
+ * system don't support it, don't error, instead return false.  Otherwise,
+ * true is returned.  Based on this, the caller can then try to call
+ * copyFile() instead, for example.
+ */
+bool
+cloneFile(const char *src, const char *dst,
+		  const char *schemaName, const char *relName,
+		  bool unsupported_ok)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(src, dst, NULL, COPYFILE_CLONE_FORCE) < 0)
+	{
+		if (unsupported_ok && errno == ENOTSUP)
+			return false;
+		else
+			pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+	}
+	return true;
+#elif defined(__linux__) && defined(FICLONE)
+	int			src_fd;
+	int			dest_fd;
+
+	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+				 schemaName, relName, src, strerror(errno));
+
+	if ((dest_fd = open(dst, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						pg_file_create_mode)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not create file \"%s\": %s\n",
+				 schemaName, relName, dst, strerror(errno));
+
+	if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+	{
+		unlink(dst);
+		if (unsupported_ok && errno == EOPNOTSUPP)
+		{
+			close(src_fd);
+			close(dest_fd);
+			return false;
+		}
+		else
+			pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+	}
+
+	close(src_fd);
+	close(dest_fd);
+	return true;
+#else
+	return false;
+#endif
+}
+
 
 /*
  * linkFile()
@@ -279,6 +348,60 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 	close(src_fd);
 }
 
+void
+check_reflink(void)
+{
+	char		existing_file[MAXPGPATH];
+	char		new_link_file[MAXPGPATH];
+
+	snprintf(existing_file, sizeof(existing_file), "%s/PG_VERSION", old_cluster.pgdata);
+	snprintf(new_link_file, sizeof(new_link_file), "%s/PG_VERSION.reflinktest", new_cluster.pgdata);
+	unlink(new_link_file);		/* might fail */
+
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(existing_file, new_link_file, NULL, COPYFILE_CLONE_FORCE) < 0)
+	{
+		if (user_opts.reflink_mode == REFLINK_ALWAYS)
+			pg_fatal("could not clone file between old and new data directories: %s\n",
+					 strerror(errno));
+		else if (user_opts.check)
+			pg_log(PG_REPORT, "could not clone file between old and new data directories: %s\n",
+				   strerror(errno));
+	}
+#elif defined(__linux__) && defined(FICLONE)
+	{
+		int			src_fd;
+		int			dest_fd;
+
+		if ((src_fd = open(existing_file, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %s\n",
+					 existing_file, strerror(errno));
+
+		if ((dest_fd = open(new_link_file, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							pg_file_create_mode)) < 0)
+			pg_fatal("could not create file \"%s\": %s\n",
+					 new_link_file, strerror(errno));
+
+		if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+		{
+			if (user_opts.reflink_mode == REFLINK_ALWAYS)
+				pg_fatal("could not clone file between old and new data directories: %s\n",
+						 strerror(errno));
+			else if (user_opts.check)
+				pg_log(PG_REPORT, "could not clone file between old and new data directories: %s\n",
+					   strerror(errno));
+		}
+
+		close(src_fd);
+		close(dest_fd);
+	}
+#else
+	pg_fatal("file cloning not supported on this platform\n");
+#endif
+
+	unlink(new_link_file);
+}
+
 void
 check_hard_link(void)
 {
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 9dbc9225a6..d52a1bcee3 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -53,6 +53,9 @@ parseCommandLine(int argc, char *argv[])
 		{"retain", no_argument, NULL, 'r'},
 		{"jobs", required_argument, NULL, 'j'},
 		{"verbose", no_argument, NULL, 'v'},
+
+		{"reflink", required_argument, NULL, 1},
+
 		{NULL, 0, NULL, 0}
 	};
 	int			option;			/* Command line option */
@@ -203,6 +206,17 @@ parseCommandLine(int argc, char *argv[])
 				log_opts.verbose = true;
 				break;
 
+			case 1:
+				if (strcmp(optarg, "always") == 0)
+					user_opts.reflink_mode = REFLINK_ALWAYS;
+				else if (strcmp(optarg, "auto") == 0)
+					user_opts.reflink_mode = REFLINK_AUTO;
+				else if (strcmp(optarg, "never") == 0)
+					user_opts.reflink_mode = REFLINK_NEVER;
+				else
+					pg_fatal("invalid reflink mode: %s\n", optarg);
+				break;
+
 			default:
 				pg_fatal("Try \"%s --help\" for more information.\n",
 						 os_info.progname);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index f83a3eeb67..16eb34e14c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -238,6 +238,16 @@ typedef enum
 	TRANSFER_MODE_LINK
 } transferMode;
 
+/*
+ * Enumeration to denote reflink modes
+ */
+typedef enum
+{
+	REFLINK_NEVER,
+	REFLINK_AUTO,
+	REFLINK_ALWAYS
+} reflinkMode;
+
 /*
  * Enumeration to denote pg_log modes
  */
@@ -297,6 +307,7 @@ typedef struct
 	bool		check;			/* true -> ask user for permission to make
 								 * changes */
 	transferMode transfer_mode; /* copy files or link them? */
+	reflinkMode	reflink_mode;
 	int			jobs;
 } UserOpts;
 
@@ -374,10 +385,14 @@ bool		pid_lock_file_exists(const char *datadir);
 
 void copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
+bool cloneFile(const char *src, const char *dst,
+		 const char *schemaName, const char *relName,
+		 bool unsupported_ok);
 void linkFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
 void rewriteVisibilityMap(const char *fromfile, const char *tofile,
 					 const char *schemaName, const char *relName);
+void		check_reflink(void);
 void		check_hard_link(void);
 
 /* fopen_priv() is no longer different from fopen() */
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index ed604f26ca..fc00cfdfae 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -252,9 +252,34 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 		}
 		else if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
 		{
-			pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
-				   old_file, new_file);
-			copyFile(old_file, new_file, map->nspname, map->relname);
+			if (user_opts.reflink_mode == REFLINK_ALWAYS)
+			{
+				pg_log(PG_VERBOSE, "cloning \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				cloneFile(old_file, new_file, map->nspname, map->relname, false);
+			}
+			else if (user_opts.reflink_mode == REFLINK_AUTO)
+			{
+				static bool		cloning_ok = true;
+
+				pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				if (cloning_ok &&
+					!cloneFile(old_file, new_file, map->nspname, map->relname, true))
+				{
+					pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+					cloning_ok = false;
+					copyFile(old_file, new_file, map->nspname, map->relname);
+				}
+				else
+					copyFile(old_file, new_file, map->nspname, map->relname);
+			}
+			else
+			{
+				pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				copyFile(old_file, new_file, map->nspname, map->relname);
+			}
 		}
 		else
 		{
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 347d5b56dc..5356d9c418 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -114,6 +114,9 @@
 /* Define to 1 if your compiler handles computed gotos. */
 #undef HAVE_COMPUTED_GOTO
 
+/* Define to 1 if you have the `copyfile' function. */
+#undef HAVE_COPYFILE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 

base-commit: 5758685c9f1a51b0a77ae8f415a57670941f2c4b
-- 
2.18.0

#25Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#24)
Re: file cloning in pg_upgrade and CREATE DATABASE

Hi Peter,

On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote:

rebased patch, no functionality changes

Could you rebase once again? I am going through the patch and wanted to
test pg_upgrade on Linux with XFS, but it does not apply anymore.

Thanks,
--
Michael

#26Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#25)
1 attachment(s)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 26/09/2018 08:44, Michael Paquier wrote:

On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote:

rebased patch, no functionality changes

Could you rebase once again? I am going through the patch and wanted to
test pg_upgrade on Linux with XFS, but it does not apply anymore.

attached

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v5-0001-pg_upgrade-Allow-use-of-file-cloning.patchtext/plain; charset=UTF-8; name=v5-0001-pg_upgrade-Allow-use-of-file-cloning.patch; x-mac-creator=0; x-mac-type=0Download
From 9a58fc2589e50d69b4b158ea5e8f3898483290d0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Thu, 27 Sep 2018 22:42:33 +0200
Subject: [PATCH v5] pg_upgrade: Allow use of file cloning

For file copying in pg_upgrade, allow using special file cloning calls
if available.  This makes the copying faster and more space efficient.
This achieves speed similar to --link mode without the associated
drawbacks.

Add an option --reflink to select whether file cloning is turned on,
off, or automatic.  Automatic is the default.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure                        |   2 +-
 configure.in                     |   2 +-
 doc/src/sgml/ref/pgupgrade.sgml  |  33 +++++++++
 src/bin/pg_upgrade/check.c       |   2 +
 src/bin/pg_upgrade/file.c        | 123 +++++++++++++++++++++++++++++++
 src/bin/pg_upgrade/option.c      |  14 ++++
 src/bin/pg_upgrade/pg_upgrade.h  |  15 ++++
 src/bin/pg_upgrade/relfilenode.c |  31 +++++++-
 src/include/pg_config.h.in       |   3 +
 9 files changed, 220 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index 6414ec1ea6..ae6f1a2e17 100755
--- a/configure
+++ b/configure
@@ -15100,7 +15100,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 158d5a1ac8..265faf1b99 100644
--- a/configure.in
+++ b/configure.in
@@ -1571,7 +1571,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
+AC_CHECK_FUNCS([cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimes wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..d994218c44 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,39 @@ <title>Options</title>
       <listitem><para>display version information, then exit</para></listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><literal><option>--reflink</option>={always|auto|never}</literal></term>
+      <listitem>
+       <para>
+        Determines whether <application>pg_upgrade</application>, when in copy
+        mode, should use efficient file cloning (also known as
+        <quote>reflinks</quote>) on some operating systems and file systems.
+        This can result in near-instantaneous copying of the data files,
+        giving the speed advantages of
+        <option>-k</option>/<option>--link</option> while leaving the old
+        cluster untouched.
+       </para>
+
+       <para>
+        The setting <literal>always</literal> requires the use of reflinks.  If
+        they are not supported, the <application>pg_upgrade</application> run
+        will abort.  Use this in production to limit the upgrade run time.
+        The setting <literal>auto</literal> uses reflinks when available,
+        otherwise it falls back to a normal copy.  This is the default.  The
+        setting <literal>never</literal> prevents use of reflinks and always
+        uses a normal copy.  This can be useful to ensure that the upgraded
+        cluster has its disk space fully allocated and not shared with the old
+        cluster.
+       </para>
+
+       <para>
+        At present, reflinks are supported on Linux (kernel 4.5 or later) with
+        Btrfs and XFS (on file systems created with reflink support, which is
+        not the default for XFS at this writing), and on macOS with APFS.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-?</option></term>
       <term><option>--help</option></term>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 5a78d603dc..eb1f18180a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -151,6 +151,8 @@ check_new_cluster(void)
 
 	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
 		check_hard_link();
+	else if (user_opts.transfer_mode == TRANSFER_MODE_COPY && user_opts.reflink_mode != REFLINK_NEVER)
+		check_reflink();
 
 	check_is_install_user(&new_cluster);
 
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index c27cc93dc2..2e864cd6bb 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -18,6 +18,13 @@
 
 #include <sys/stat.h>
 #include <fcntl.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
+#ifdef __linux__
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#endif
 
 
 #ifdef WIN32
@@ -93,6 +100,68 @@ copyFile(const char *src, const char *dst,
 #endif							/* WIN32 */
 }
 
+/*
+ * cloneFile()
+ *
+ * Clones/reflinks a relation file from src to dst.
+ *
+ * schemaName/relName are relation's SQL name (used for error messages only).
+ *
+ * If unsupported_ok is true, then if the cloning fails because the OS or file
+ * system don't support it, don't error, instead return false.  Otherwise,
+ * true is returned.  Based on this, the caller can then try to call
+ * copyFile() instead, for example.
+ */
+bool
+cloneFile(const char *src, const char *dst,
+		  const char *schemaName, const char *relName,
+		  bool unsupported_ok)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(src, dst, NULL, COPYFILE_CLONE_FORCE) < 0)
+	{
+		if (unsupported_ok && errno == ENOTSUP)
+			return false;
+		else
+			pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+	}
+	return true;
+#elif defined(__linux__) && defined(FICLONE)
+	int			src_fd;
+	int			dest_fd;
+
+	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+				 schemaName, relName, src, strerror(errno));
+
+	if ((dest_fd = open(dst, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						pg_file_create_mode)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not create file \"%s\": %s\n",
+				 schemaName, relName, dst, strerror(errno));
+
+	if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+	{
+		unlink(dst);
+		if (unsupported_ok && errno == EOPNOTSUPP)
+		{
+			close(src_fd);
+			close(dest_fd);
+			return false;
+		}
+		else
+			pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+					 schemaName, relName, src, dst, strerror(errno));
+	}
+
+	close(src_fd);
+	close(dest_fd);
+	return true;
+#else
+	return false;
+#endif
+}
+
 
 /*
  * linkFile()
@@ -270,6 +339,60 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 	close(src_fd);
 }
 
+void
+check_reflink(void)
+{
+	char		existing_file[MAXPGPATH];
+	char		new_link_file[MAXPGPATH];
+
+	snprintf(existing_file, sizeof(existing_file), "%s/PG_VERSION", old_cluster.pgdata);
+	snprintf(new_link_file, sizeof(new_link_file), "%s/PG_VERSION.reflinktest", new_cluster.pgdata);
+	unlink(new_link_file);		/* might fail */
+
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(existing_file, new_link_file, NULL, COPYFILE_CLONE_FORCE) < 0)
+	{
+		if (user_opts.reflink_mode == REFLINK_ALWAYS)
+			pg_fatal("could not clone file between old and new data directories: %s\n",
+					 strerror(errno));
+		else if (user_opts.check)
+			pg_log(PG_REPORT, "could not clone file between old and new data directories: %s\n",
+				   strerror(errno));
+	}
+#elif defined(__linux__) && defined(FICLONE)
+	{
+		int			src_fd;
+		int			dest_fd;
+
+		if ((src_fd = open(existing_file, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %s\n",
+					 existing_file, strerror(errno));
+
+		if ((dest_fd = open(new_link_file, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							pg_file_create_mode)) < 0)
+			pg_fatal("could not create file \"%s\": %s\n",
+					 new_link_file, strerror(errno));
+
+		if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+		{
+			if (user_opts.reflink_mode == REFLINK_ALWAYS)
+				pg_fatal("could not clone file between old and new data directories: %s\n",
+						 strerror(errno));
+			else if (user_opts.check)
+				pg_log(PG_REPORT, "could not clone file between old and new data directories: %s\n",
+					   strerror(errno));
+		}
+
+		close(src_fd);
+		close(dest_fd);
+	}
+#else
+	pg_fatal("file cloning not supported on this platform\n");
+#endif
+
+	unlink(new_link_file);
+}
+
 void
 check_hard_link(void)
 {
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 9dbc9225a6..d52a1bcee3 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -53,6 +53,9 @@ parseCommandLine(int argc, char *argv[])
 		{"retain", no_argument, NULL, 'r'},
 		{"jobs", required_argument, NULL, 'j'},
 		{"verbose", no_argument, NULL, 'v'},
+
+		{"reflink", required_argument, NULL, 1},
+
 		{NULL, 0, NULL, 0}
 	};
 	int			option;			/* Command line option */
@@ -203,6 +206,17 @@ parseCommandLine(int argc, char *argv[])
 				log_opts.verbose = true;
 				break;
 
+			case 1:
+				if (strcmp(optarg, "always") == 0)
+					user_opts.reflink_mode = REFLINK_ALWAYS;
+				else if (strcmp(optarg, "auto") == 0)
+					user_opts.reflink_mode = REFLINK_AUTO;
+				else if (strcmp(optarg, "never") == 0)
+					user_opts.reflink_mode = REFLINK_NEVER;
+				else
+					pg_fatal("invalid reflink mode: %s\n", optarg);
+				break;
+
 			default:
 				pg_fatal("Try \"%s --help\" for more information.\n",
 						 os_info.progname);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index f83a3eeb67..16eb34e14c 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -238,6 +238,16 @@ typedef enum
 	TRANSFER_MODE_LINK
 } transferMode;
 
+/*
+ * Enumeration to denote reflink modes
+ */
+typedef enum
+{
+	REFLINK_NEVER,
+	REFLINK_AUTO,
+	REFLINK_ALWAYS
+} reflinkMode;
+
 /*
  * Enumeration to denote pg_log modes
  */
@@ -297,6 +307,7 @@ typedef struct
 	bool		check;			/* true -> ask user for permission to make
 								 * changes */
 	transferMode transfer_mode; /* copy files or link them? */
+	reflinkMode	reflink_mode;
 	int			jobs;
 } UserOpts;
 
@@ -374,10 +385,14 @@ bool		pid_lock_file_exists(const char *datadir);
 
 void copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
+bool cloneFile(const char *src, const char *dst,
+		 const char *schemaName, const char *relName,
+		 bool unsupported_ok);
 void linkFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
 void rewriteVisibilityMap(const char *fromfile, const char *tofile,
 					 const char *schemaName, const char *relName);
+void		check_reflink(void);
 void		check_hard_link(void);
 
 /* fopen_priv() is no longer different from fopen() */
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index ed604f26ca..fc00cfdfae 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -252,9 +252,34 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 		}
 		else if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
 		{
-			pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
-				   old_file, new_file);
-			copyFile(old_file, new_file, map->nspname, map->relname);
+			if (user_opts.reflink_mode == REFLINK_ALWAYS)
+			{
+				pg_log(PG_VERBOSE, "cloning \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				cloneFile(old_file, new_file, map->nspname, map->relname, false);
+			}
+			else if (user_opts.reflink_mode == REFLINK_AUTO)
+			{
+				static bool		cloning_ok = true;
+
+				pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				if (cloning_ok &&
+					!cloneFile(old_file, new_file, map->nspname, map->relname, true))
+				{
+					pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+					cloning_ok = false;
+					copyFile(old_file, new_file, map->nspname, map->relname);
+				}
+				else
+					copyFile(old_file, new_file, map->nspname, map->relname);
+			}
+			else
+			{
+				pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+					   old_file, new_file);
+				copyFile(old_file, new_file, map->nspname, map->relname);
+			}
 		}
 		else
 		{
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 90dda8ea05..2c57e31dcd 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -114,6 +114,9 @@
 /* Define to 1 if your compiler handles computed gotos. */
 #undef HAVE_COMPUTED_GOTO
 
+/* Define to 1 if you have the `copyfile' function. */
+#undef HAVE_COPYFILE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 

base-commit: 27e082b0c6e564facfbf54b56090fdcc4bf44cca
-- 
2.19.0

#27Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#26)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Thu, Sep 27, 2018 at 11:10:08PM +0200, Peter Eisentraut wrote:

On 26/09/2018 08:44, Michael Paquier wrote:

Could you rebase once again? I am going through the patch and wanted to
test pg_upgrade on Linux with XFS, but it does not apply anymore.

attached

Thanks for the rebase. At the end I got my hands on only an APFS using
a mac. I ran a test with an instance holding a database with pgbench at
scaling factor 500, which gives close to 6.5GB. The results are nice on
my laptop:
- --reflink=never runs in 15s
- --reflink=always runs in 4s
So that's a very nice gain!

+    static bool     cloning_ok = true;
+
+    pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+           old_file, new_file);
+    if (cloning_ok &&
+        !cloneFile(old_file, new_file, map->nspname, map->relname, true))
+    {
+        pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+        cloning_ok = false;
+        copyFile(old_file, new_file, map->nspname, map->relname);
+    }
+    else
+        copyFile(old_file, new_file, map->nspname, map->relname);

This part overlaps with the job that check_reflink() already does.
Wouldn't it be more simple to have check_reflink do a one-time check
with the automatic mode, enforcing to REFLINK_NEVER if cloning test did
not pass when REFLINK_AUTO is used? This would simplify
transfer_relfile().

The --help output of pg_upgrade has not been updated.

I am not a fan of the --reflink interface to be honest, even if this
maps to what cp offers, particularly because there is already the --link
mode, and that the new option interacts with it. Here is an idea of
interface with an option named, named say, --transfer-method:
- "link", maps to the existing --link, which is just kept as a
deprecated alias.
- "clone" is the new mode you propose.
- "copy" is the default, and copies directly files. This automatic mode
also makes the implementation around transfer_relfile more difficult to
apprehend in my opinion, and I would think that all the different
transfer modes ought to be maintained within it. pg_upgrade.h also has
logic for such transfer modes.

After that, the implementation of cloneFile() looks logically correct to
me.
--
Michael

#28Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#27)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Fri, Sep 28, 2018 at 02:19:53PM +0900, Michael Paquier wrote:

Thanks for the rebase. At the end I got my hands on only an APFS using
a mac. I ran a test with an instance holding a database with pgbench at
scaling factor 500, which gives close to 6.5GB. The results are nice on
my laptop:
- --reflink=never runs in 15s
- --reflink=always runs in 4s
So that's a very nice gain!

Please note that I have moved the patch to CF 2018-11, as the last
review is very recent.
--
Michael

#29Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#27)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 28/09/2018 07:19, Michael Paquier wrote:

+    static bool     cloning_ok = true;
+
+    pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+           old_file, new_file);
+    if (cloning_ok &&
+        !cloneFile(old_file, new_file, map->nspname, map->relname, true))
+    {
+        pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+        cloning_ok = false;
+        copyFile(old_file, new_file, map->nspname, map->relname);
+    }
+    else
+        copyFile(old_file, new_file, map->nspname, map->relname);

This part overlaps with the job that check_reflink() already does.
Wouldn't it be more simple to have check_reflink do a one-time check
with the automatic mode, enforcing to REFLINK_NEVER if cloning test did
not pass when REFLINK_AUTO is used? This would simplify
transfer_relfile().

I'll look into that.

The --help output of pg_upgrade has not been updated.

will fix

I am not a fan of the --reflink interface to be honest, even if this
maps to what cp offers, particularly because there is already the --link
mode, and that the new option interacts with it. Here is an idea of
interface with an option named, named say, --transfer-method:
- "link", maps to the existing --link, which is just kept as a
deprecated alias.
- "clone" is the new mode you propose.
- "copy" is the default, and copies directly files. This automatic mode
also makes the implementation around transfer_relfile more difficult to
apprehend in my opinion, and I would think that all the different
transfer modes ought to be maintained within it. pg_upgrade.h also has
logic for such transfer modes.

I can see the argument for that. But I don't understand where the
automatic mode fits into this. I would like to keep all three modes
from my patch: copy, clone-if-possible, clone-or-fail, unless you want
to argue against that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#29)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Oct 02, 2018 at 02:31:35PM +0200, Peter Eisentraut wrote:

I can see the argument for that. But I don't understand where the
automatic mode fits into this. I would like to keep all three modes
from my patch: copy, clone-if-possible, clone-or-fail, unless you want
to argue against that.

I'd like to argue against that :)

There could be an argument for having an automatic more within this
scheme, still I am not really a fan of this. When somebody integrates
pg_upgrade within an upgrade framework, they would likely test if
cloning actually works, bumping immediately on a failure, no? I'd like
to think that copy should be the default, cloning being available as an
option. Cloning is not supported on many filesystems anyway..
--
Michael

#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#30)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 2018-Oct-03, Michael Paquier wrote:

There could be an argument for having an automatic more within this
scheme, still I am not really a fan of this. When somebody integrates
pg_upgrade within an upgrade framework, they would likely test if
cloning actually works, bumping immediately on a failure, no? I'd like
to think that copy should be the default, cloning being available as an
option. Cloning is not supported on many filesystems anyway..

I'm not clear what interface are you proposing. Maybe they would just
use the clone-or-fail mode, and note whether it fails? If that's not
it, please explain.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#32Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#31)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:

I'm not clear what interface are you proposing. Maybe they would just
use the clone-or-fail mode, and note whether it fails? If that's not
it, please explain.

Okay. What I am proposing is to not have any kind of automatic mode to
keep the code simple, with a new option called --transfer-mode, able to
do three things:
- "link", which is the equivalent of the existing --link.
- "copy", the default and the current behavior, which copies files.
- "clone", the new mode proposed in this thread.
--
Michael

#33Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#32)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 2018-Oct-03, Michael Paquier wrote:

On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:

I'm not clear what interface are you proposing. Maybe they would just
use the clone-or-fail mode, and note whether it fails? If that's not
it, please explain.

Okay. What I am proposing is to not have any kind of automatic mode to
keep the code simple, with a new option called --transfer-mode, able to
do three things:
- "link", which is the equivalent of the existing --link.
- "copy", the default and the current behavior, which copies files.
- "clone", the new mode proposed in this thread.

I see. Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Thinking about a generic tool that wraps pg_upgrade (say, Debian's
wrapper for it) this makes sense: just use the fastest non-destructive
mode available. Which ones are available depends on what the underlying
filesystem is, so it's not up to the tool's writer to decide which to
use ahead of time.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#33)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Oct 02, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:

I see. Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Yes, a mode which depends on what the file system supports. Perhaps
"safe" or "fast" could be another name, in the shape of the fastest
method available which does not destroy the existing cluster's data.

Thinking about a generic tool that wraps pg_upgrade (say, Debian's
wrapper for it) this makes sense: just use the fastest non-destructive
mode available. Which ones are available depends on what the underlying
filesystem is, so it's not up to the tool's writer to decide which to
use ahead of time.

This could have merit. Now, it seems to me that we have two separate
concepts here, which should be addressed separately:
1) cloning file mode, which is the new actual feature.
2) automatic mode, which is a subset of the copy mode and the new clone
mode. What I am not sure when it comes to that stuff is if we should
consider in some way the link mode as being part of this automatic
selection concept..
--
Michael

#35Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#33)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Tue, Oct 2, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:

On 2018-Oct-03, Michael Paquier wrote:

On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:

I'm not clear what interface are you proposing. Maybe they would just
use the clone-or-fail mode, and note whether it fails? If that's not
it, please explain.

Okay. What I am proposing is to not have any kind of automatic mode to
keep the code simple, with a new option called --transfer-mode, able to
do three things:
- "link", which is the equivalent of the existing --link.
- "copy", the default and the current behavior, which copies files.
- "clone", the new mode proposed in this thread.

I see. Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Uh, if you use --link, and the two data directories are on different
file systems, it fails. No one has ever asked for link-or-copy, so why
are we considering clone-or-copy? Are we going to need
link-or-clone-or-copy too? I do realize that clone and copy have
non-destructive behavior on the old cluster once started, so it does
make some sense to merge them, unlike link.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#36Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Bruce Momjian (#35)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 10/10/2018 21:50, Bruce Momjian wrote:

I see. Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Uh, if you use --link, and the two data directories are on different
file systems, it fails. No one has ever asked for link-or-copy, so why
are we considering clone-or-copy? Are we going to need
link-or-clone-or-copy too? I do realize that clone and copy have
non-destructive behavior on the old cluster once started, so it does
make some sense to merge them, unlike link.

I'm OK to get rid of the clone-or-copy mode. I can see the confusion.

The original reason for this behavior was that the Linux implementation
used copy_file_range(), which does clone-or-copy without any way to
control it. But the latest patch doesn't use that anymore, so we don't
really need it, if it's controversial.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#37Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Peter Eisentraut (#36)
1 attachment(s)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 11/10/2018 16:50, Peter Eisentraut wrote:

On 10/10/2018 21:50, Bruce Momjian wrote:

I see. Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Uh, if you use --link, and the two data directories are on different
file systems, it fails. No one has ever asked for link-or-copy, so why
are we considering clone-or-copy? Are we going to need
link-or-clone-or-copy too? I do realize that clone and copy have
non-destructive behavior on the old cluster once started, so it does
make some sense to merge them, unlike link.

I'm OK to get rid of the clone-or-copy mode. I can see the confusion.

New patch that removes all the various reflink modes and adds a new
option --clone that works similar to --link. I think it's much cleaner now.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v6-0001-pg_upgrade-Allow-use-of-file-cloning.patchtext/plain; charset=UTF-8; name=v6-0001-pg_upgrade-Allow-use-of-file-cloning.patch; x-mac-creator=0; x-mac-type=0Download
From 8b7285e61e425fad4d07a011efe2ccd7fd280d54 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Thu, 18 Oct 2018 23:09:59 +0200
Subject: [PATCH v6] pg_upgrade: Allow use of file cloning

Add another transfer mode --clone to pg_upgrade (besides the existing
--link and the default copy), using special file cloning calls.  This
makes the file transfer faster and more space efficient, achieving
speed similar to --link mode without the associated drawbacks.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure                        |  2 +-
 configure.in                     |  1 +
 doc/src/sgml/ref/pgupgrade.sgml  | 37 ++++++++++---
 src/bin/pg_upgrade/check.c       | 13 ++++-
 src/bin/pg_upgrade/file.c        | 90 ++++++++++++++++++++++++++++++++
 src/bin/pg_upgrade/option.c      |  8 +++
 src/bin/pg_upgrade/pg_upgrade.h  |  6 ++-
 src/bin/pg_upgrade/relfilenode.c | 44 ++++++++++------
 src/include/pg_config.h.in       |  3 ++
 9 files changed, 179 insertions(+), 25 deletions(-)

diff --git a/configure b/configure
index 43ae8c869d..5a7aa02b27 100755
--- a/configure
+++ b/configure
@@ -15130,7 +15130,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 519ecd5e1e..7f22bcee75 100644
--- a/configure.in
+++ b/configure.in
@@ -1602,6 +1602,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	cbrt
 	clock_gettime
+	copyfile
 	fdatasync
 	getifaddrs
 	getpeerucred
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..e616ea9dbc 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,25 @@ <title>Options</title>
       <listitem><para>display version information, then exit</para></listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--clone</option></term>
+      <listitem>
+       <para>
+        Use efficient file cloning (also known as <quote>reflinks</quote>)
+        instead of copying files to the new cluster.  This can result in
+        near-instantaneous copying of the data files, giving the speed
+        advantages of <option>-k</option>/<option>--link</option> while
+        leaving the old cluster untouched.
+       </para>
+
+       <para>
+        At present, reflinks are supported on Linux (kernel 4.5 or later) with
+        Btrfs and XFS (on file systems created with reflink support, which is
+        not the default for XFS at this writing), and on macOS with APFS.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-?</option></term>
       <term><option>--help</option></term>
@@ -340,7 +359,7 @@ <title>Run <application>pg_upgrade</application></title>
      Always run the <application>pg_upgrade</application> binary of the new server, not the old one.
      <application>pg_upgrade</application> requires the specification of the old and new cluster's
      data and executable (<filename>bin</filename>) directories. You can also specify
-     user and port values, and whether you want the data files linked
+     user and port values, and whether you want the data files linked or cloned
      instead of the default copy behavior.
     </para>
 
@@ -351,8 +370,12 @@ <title>Run <application>pg_upgrade</application></title>
      once you start the new cluster after the upgrade.  Link mode also
      requires that the old and new cluster data directories be in the
      same file system.  (Tablespaces and <filename>pg_wal</filename> can be on
-     different file systems.)  See <literal>pg_upgrade --help</literal> for a full
-     list of options.
+     different file systems.)
+     The clone mode provides the same speed and disk space advantages but will
+     not leave the old cluster unusable after the upgrade.  The clone mode
+     also requires that the old and new data directories be in the same file
+     system.  The clone mode is only available on certain operating systems
+     and file systems.
     </para>
 
     <para>
@@ -388,8 +411,9 @@ <title>Run <application>pg_upgrade</application></title>
      to perform only the checks, even if the old server is still
      running. <command>pg_upgrade --check</command> will also outline any
      manual adjustments you will need to make after the upgrade.  If you
-     are going to be using link mode, you should use the <option>--link</option>
-     option with <option>--check</option> to enable link-mode-specific checks.
+     are going to be using link or clone mode, you should use the option
+     <option>--link</option> or <option>--clone</option> with
+     <option>--check</option> to enable mode-specific checks.
      <command>pg_upgrade</command> requires write permission in the current directory.
     </para>
 
@@ -722,7 +746,8 @@ <title>Notes</title>
 
   <para>
    If you want to use link mode and you do not want your old cluster
-   to be modified when the new cluster is started, make a copy of the
+   to be modified when the new cluster is started, consider using the clone mode.
+   If that is not available, make a copy of the
    old cluster and upgrade that in link mode. To make a valid copy
    of the old cluster, use <command>rsync</command> to create a dirty
    copy of the old cluster while the server is running, then shut down
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 5a78d603dc..555e5dcbba 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -149,8 +149,17 @@ check_new_cluster(void)
 
 	check_loadable_libraries();
 
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		check_hard_link();
+	switch (user_opts.transfer_mode)
+	{
+		case TRANSFER_MODE_CLONE:
+			check_file_clone();
+			break;
+		case TRANSFER_MODE_COPY:
+			break;
+		case TRANSFER_MODE_LINK:
+			check_hard_link();
+			break;
+	}
 
 	check_is_install_user(&new_cluster);
 
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index c27cc93dc2..244dd4d88b 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -18,6 +18,13 @@
 
 #include <sys/stat.h>
 #include <fcntl.h>
+#ifdef HAVE_COPYFILE
+#include <copyfile.h>
+#endif
+#ifdef __linux__
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#endif
 
 
 #ifdef WIN32
@@ -25,6 +32,47 @@ static int	win32_pghardlink(const char *src, const char *dst);
 #endif
 
 
+/*
+ * cloneFile()
+ *
+ * Clones/reflinks a relation file from src to dst.
+ *
+ * schemaName/relName are relation's SQL name (used for error messages only).
+ */
+void
+cloneFile(const char *src, const char *dst,
+		  const char *schemaName, const char *relName)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(src, dst, NULL, COPYFILE_CLONE_FORCE) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+				 schemaName, relName, src, dst, strerror(errno));
+#elif defined(__linux__) && defined(FICLONE)
+	int			src_fd;
+	int			dest_fd;
+
+	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+				 schemaName, relName, src, strerror(errno));
+
+	if ((dest_fd = open(dst, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						pg_file_create_mode)) < 0)
+		pg_fatal("error while cloning relation \"%s.%s\": could not create file \"%s\": %s\n",
+				 schemaName, relName, dst, strerror(errno));
+
+	if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+	{
+		unlink(dst);
+		pg_fatal("error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s\n",
+				 schemaName, relName, src, dst, strerror(errno));
+	}
+
+	close(src_fd);
+	close(dest_fd);
+#endif
+}
+
+
 /*
  * copyFile()
  *
@@ -270,6 +318,48 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile,
 	close(src_fd);
 }
 
+void
+check_file_clone(void)
+{
+	char		existing_file[MAXPGPATH];
+	char		new_link_file[MAXPGPATH];
+
+	snprintf(existing_file, sizeof(existing_file), "%s/PG_VERSION", old_cluster.pgdata);
+	snprintf(new_link_file, sizeof(new_link_file), "%s/PG_VERSION.clonetest", new_cluster.pgdata);
+	unlink(new_link_file);		/* might fail */
+
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+	if (copyfile(existing_file, new_link_file, NULL, COPYFILE_CLONE_FORCE) < 0)
+		pg_fatal("could not clone file between old and new data directories: %s\n",
+				 strerror(errno));
+#elif defined(__linux__) && defined(FICLONE)
+	{
+		int			src_fd;
+		int			dest_fd;
+
+		if ((src_fd = open(existing_file, O_RDONLY | PG_BINARY, 0)) < 0)
+			pg_fatal("could not open file \"%s\": %s\n",
+					 existing_file, strerror(errno));
+
+		if ((dest_fd = open(new_link_file, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							pg_file_create_mode)) < 0)
+			pg_fatal("could not create file \"%s\": %s\n",
+					 new_link_file, strerror(errno));
+
+		if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+			pg_fatal("could not clone file between old and new data directories: %s\n",
+					 strerror(errno));
+
+		close(src_fd);
+		close(dest_fd);
+	}
+#else
+	pg_fatal("file cloning not supported on this platform\n");
+#endif
+
+	unlink(new_link_file);
+}
+
 void
 check_hard_link(void)
 {
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 9dbc9225a6..eb3b9f603d 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -53,6 +53,9 @@ parseCommandLine(int argc, char *argv[])
 		{"retain", no_argument, NULL, 'r'},
 		{"jobs", required_argument, NULL, 'j'},
 		{"verbose", no_argument, NULL, 'v'},
+
+		{"clone", no_argument, NULL, 1},
+
 		{NULL, 0, NULL, 0}
 	};
 	int			option;			/* Command line option */
@@ -203,6 +206,10 @@ parseCommandLine(int argc, char *argv[])
 				log_opts.verbose = true;
 				break;
 
+			case 1:
+				user_opts.transfer_mode = TRANSFER_MODE_CLONE;
+				break;
+
 			default:
 				pg_fatal("Try \"%s --help\" for more information.\n",
 						 os_info.progname);
@@ -293,6 +300,7 @@ usage(void)
 	printf(_("  -U, --username=NAME           cluster superuser (default \"%s\")\n"), os_info.user);
 	printf(_("  -v, --verbose                 enable verbose internal logging\n"));
 	printf(_("  -V, --version                 display version information, then exit\n"));
+	printf(_("  --clone                       clone instead of copying files to new cluster\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
 			 "Before running pg_upgrade you must:\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index f83a3eeb67..51bd211d46 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -230,10 +230,11 @@ typedef struct
 } ControlData;
 
 /*
- * Enumeration to denote link modes
+ * Enumeration to denote transfer modes
  */
 typedef enum
 {
+	TRANSFER_MODE_CLONE,
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_LINK
 } transferMode;
@@ -372,12 +373,15 @@ bool		pid_lock_file_exists(const char *datadir);
 
 /* file.c */
 
+void cloneFile(const char *src, const char *dst,
+		 const char *schemaName, const char *relName);
 void copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
 void linkFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName);
 void rewriteVisibilityMap(const char *fromfile, const char *tofile,
 					 const char *schemaName, const char *relName);
+void		check_file_clone(void);
 void		check_hard_link(void);
 
 /* fopen_priv() is no longer different from fopen() */
diff --git a/src/bin/pg_upgrade/relfilenode.c b/src/bin/pg_upgrade/relfilenode.c
index ed604f26ca..3b16c92a02 100644
--- a/src/bin/pg_upgrade/relfilenode.c
+++ b/src/bin/pg_upgrade/relfilenode.c
@@ -30,10 +30,18 @@ void
 transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 							 char *old_pgdata, char *new_pgdata)
 {
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		pg_log(PG_REPORT, "Linking user relation files\n");
-	else
-		pg_log(PG_REPORT, "Copying user relation files\n");
+	switch (user_opts.transfer_mode)
+	{
+		case TRANSFER_MODE_CLONE:
+			pg_log(PG_REPORT, "Cloning user relation files\n");
+			break;
+		case TRANSFER_MODE_COPY:
+			pg_log(PG_REPORT, "Copying user relation files\n");
+			break;
+		case TRANSFER_MODE_LINK:
+			pg_log(PG_REPORT, "Linking user relation files\n");
+			break;
+	}
 
 	/*
 	 * Transferring files by tablespace is tricky because a single database
@@ -250,17 +258,23 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 				   old_file, new_file);
 			rewriteVisibilityMap(old_file, new_file, map->nspname, map->relname);
 		}
-		else if (user_opts.transfer_mode == TRANSFER_MODE_COPY)
-		{
-			pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
-				   old_file, new_file);
-			copyFile(old_file, new_file, map->nspname, map->relname);
-		}
 		else
-		{
-			pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n",
-				   old_file, new_file);
-			linkFile(old_file, new_file, map->nspname, map->relname);
-		}
+			switch (user_opts.transfer_mode)
+			{
+				case TRANSFER_MODE_CLONE:
+					pg_log(PG_VERBOSE, "cloning \"%s\" to \"%s\"\n",
+						   old_file, new_file);
+					cloneFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_COPY:
+					pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+						   old_file, new_file);
+					copyFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_LINK:
+					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"\n",
+						   old_file, new_file);
+					linkFile(old_file, new_file, map->nspname, map->relname);
+			}
 	}
 }
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9798bd24b4..d2a356a356 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -114,6 +114,9 @@
 /* Define to 1 if your compiler handles computed gotos. */
 #undef HAVE_COMPUTED_GOTO
 
+/* Define to 1 if you have the `copyfile' function. */
+#undef HAVE_COPYFILE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 

base-commit: e74dd00f53cd6dc1887f76b9672e5f6dcf0fd8a2
-- 
2.19.1

#38Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#37)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:

New patch that removes all the various reflink modes and adds a new
option --clone that works similar to --link. I think it's much cleaner now.

Thanks Peter for the new version.

+
+		{"clone", no_argument, NULL, 1},
+
 		{NULL, 0, NULL, 0}

Why newlines here?

@@ -293,6 +300,7 @@ usage(void)
 	printf(_("  -U, --username=NAME           cluster superuser (default \"%s\")\n"), os_info.user);
 	printf(_("  -v, --verbose                 enable verbose internal logging\n"));
 	printf(_("  -V, --version                 display version information, then exit\n"));
+	printf(_("  --clone                       clone instead of copying files to new cluster\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"

An idea for a one-letter option could be -n. This patch can live
without.

+     pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+              schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators? One
is generated with pg_fatal("error while cloning relation \"%s.%s\":
could not open file \"%s\": %s\n",
+ schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators? One
is generated with PG_VERBOSE to mention that cloning checks are
done, and the second one is fatal with the actual errors.

Those are all minor points, the patch looks good to me.
--
Michael

#39Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#38)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 19/10/2018 01:50, Michael Paquier wrote:

On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:

New patch that removes all the various reflink modes and adds a new
option --clone that works similar to --link. I think it's much cleaner now.

Thanks Peter for the new version.

+
+		{"clone", no_argument, NULL, 1},
+
{NULL, 0, NULL, 0}

Why newlines here?

fixed

@@ -293,6 +300,7 @@ usage(void)
printf(_("  -U, --username=NAME           cluster superuser (default \"%s\")\n"), os_info.user);
printf(_("  -v, --verbose                 enable verbose internal logging\n"));
printf(_("  -V, --version                 display version information, then exit\n"));
+	printf(_("  --clone                       clone instead of copying files to new cluster\n"));
printf(_("  -?, --help                    show this help, then exit\n"));
printf(_("\n"

An idea for a one-letter option could be -n. This patch can live
without.

-n is often used for something like "dry run", so it didn't go for that
here. I suspect the cloning will remain a marginal option for some
time, so having only a long option is acceptable.

+     pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+              schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators? One
is generated with pg_fatal("error while cloning relation \"%s.%s\":
could not open file \"%s\": %s\n",
+ schemaName, relName, src, strerror(errno));

I think this is too complicated for a few messages.

Those are all minor points, the patch looks good to me.

Committed, thanks.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#39)
Re: file cloning in pg_upgrade and CREATE DATABASE

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

Committed, thanks.

It seems unfortunate that there is no test coverage in the committed
patch. I realize that it may be hard to test given the filesystem
dependency, but how will we know if it works at all?

regards, tom lane

#41Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#39)
Re: file cloning in pg_upgrade and CREATE DATABASE

On Wed, Nov 07, 2018 at 06:42:00PM +0100, Peter Eisentraut wrote:

-n is often used for something like "dry run", so it didn't go for that
here. I suspect the cloning will remain a marginal option for some
time, so having only a long option is acceptable.

Good point. I didn't consider this, and it is true that --dry-run is
used, pg_rewind being one example.
--
Michael

#42Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Tom Lane (#40)
Re: file cloning in pg_upgrade and CREATE DATABASE

On 07/11/2018 19:03, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

Committed, thanks.

It seems unfortunate that there is no test coverage in the committed
patch. I realize that it may be hard to test given the filesystem
dependency, but how will we know if it works at all?

You can use something like

PG_UPGRADE_OPTS=--clone make -C src/bin/pg_upgrade check

(--link is similarly untested.)

If we do get the TAP tests for pg_upgrade set up, we can create smaller
and faster test cases for this.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services