Btrfs clone WIP patch

Started by Jonathan Rogersalmost 13 years ago11 messages

jrogers@socialserve.com

almost 13 years ago

1 attachment(s)

This patch against PostgreSQL 9.1.8 takes advantage of efficient file
cloning on Linux Btrfs file systems to make CREATE DATABASE operations
extremely fast regardless of the size of the database used as a
template. On my system, I can create a database from a multi-gibibyte
template in a second or less. This is very useful for automated testing
as well in a development environment where reverting to a baseline
database is frequently required. As an added bonus, newly created
databases require very little additional disk storage until they diverge
from the template.

The efficient cloning is accomplished by a Btrfs-specific ioctl() call.
On non-Linux systems or if the ioctl() call fails, file contents are
copied in the conventional way so no configuration is needed. This has
been tested on a Linux system on both Btrfs and XFS file systems as well
as an OSX system.

The clone_file() function was originally copied from GNU coreutils which
is under GPL v3. The function is currently only about ten lines long and
contains little essential information beyond the magic values needed for
the ioctl() call so I'm not sure if license is a problem.
--
Jonathan Ross Rogers

Attachments:

btrfs_clone.patchtext/x-patch; name=btrfs_clone.patchDownload

diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 6cfb816..719a5c1 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,6 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#ifdef HAVE_SYS_IOCTL_H
+#include <sys/ioctl.h>
+#endif
+
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
@@ -139,6 +143,24 @@ copydir(char *fromdir, char *todir, bool recurse)
 }
 
 /*
+ * Perform the O(1) btrfs clone operation, if possible.
+ * Upon success, return 0.  Otherwise, return -1.
+ */
+static inline int
+clone_file (int dest_fd, int src_fd)
+{
+#ifdef __linux__
+# define BTRFS_IOCTL_MAGIC 0x94
+# define BTRFS_IOC_CLONE _IOW (BTRFS_IOCTL_MAGIC, 9, int)
+	return ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);
+#else
+	(void) dest_fd;
+	(void) src_fd;
+	return -1;
+#endif
+}
+
+/*
  * copy one file
  */
 void
@@ -150,11 +172,6 @@ copy_file(char *fromfile, char *tofile)
 	int			nbytes;
 	off_t		offset;
 
-	/* Use palloc to ensure we get a maxaligned buffer */
-#define COPY_BUF_SIZE (8 * BLCKSZ)
-
-	buffer = palloc(COPY_BUF_SIZE);
-
 	/*
 	 * Open the files
 	 */
@@ -171,38 +188,54 @@ copy_file(char *fromfile, char *tofile)
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tofile)));
 
-	/*
-	 * Do the data copying.
-	 */
-	for (offset = 0;; offset += nbytes)
+	if (clone_file (dstfd, srcfd) == 0)
+		ereport(DEBUG1, (errmsg("Cloned \"%s\" to \"%s\".", fromfile, tofile)));
+
+	else
 	{
-		/* If we got a cancel signal during the copy of the file, quit */
-		CHECK_FOR_INTERRUPTS();
+		/*
+		 * Do the data copying.
+		 */
 
-		nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", fromfile)));
-		if (nbytes == 0)
-			break;
-		errno = 0;
-		if ((int) write(dstfd, buffer, nbytes) != nbytes)
+		/* Use palloc to ensure we get a maxaligned buffer */
+#define COPY_BUF_SIZE (8 * BLCKSZ)
+
+		buffer = palloc(COPY_BUF_SIZE);
+
+		ereport(DEBUG1, (errmsg("Copying \"%s\" to \"%s\" in userspace.",
+								fromfile, tofile)));
+		for (offset = 0;; offset += nbytes)
 		{
-			/* if write didn't set errno, assume problem is no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tofile)));
+			/* If we got a cancel signal during the copy of the file, quit */
+			CHECK_FOR_INTERRUPTS();
+
+			nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read file \"%s\": %m", fromfile)));
+			if (nbytes == 0)
+				break;
+			errno = 0;
+			if ((int) write(dstfd, buffer, nbytes) != nbytes)
+			{
+				/* if write didn't set errno, assume problem is no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write to file \"%s\": %m", tofile)));
+			}
+
+			/*
+			 * We fsync the files later but first flush them to avoid spamming the
+			 * cache and hopefully get the kernel to start writing them out before
+			 * the fsync comes.
+			 */
+			pg_flush_data(dstfd, offset, nbytes);
 		}
 
-		/*
-		 * We fsync the files later but first flush them to avoid spamming the
-		 * cache and hopefully get the kernel to start writing them out before
-		 * the fsync comes.
-		 */
-		pg_flush_data(dstfd, offset, nbytes);
+		pfree(buffer);
 	}
 
 	if (close(dstfd))
@@ -212,7 +245,6 @@ copy_file(char *fromfile, char *tofile)
 
 	close(srcfd);
 
-	pfree(buffer);
 }

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Jonathan Rogers (#1)

Re: Btrfs clone WIP patch

Jonathan Rogers <jrogers@socialserve.com> writes:

This patch against PostgreSQL 9.1.8 takes advantage of efficient file
cloning on Linux Btrfs file systems to make CREATE DATABASE operations
extremely fast regardless of the size of the database used as a
template.

It would be easier to review this patch if the bulk of it weren't simple
reindentation of existing code. (Or at least it ought to be that ---
I object to your having moved the buffer palloc inside the loop. A
patch that is trying to optimize a minority case can expect to be
rejected if it makes things worse for everyone else.)

Consider whether you can't phrase the patch to avoid that, perhaps by
use of "continue" instead of an else-block. Alternatively, enclose the
existing code in braces but don't reindent it, ie,

+	if (whatever)
+		... new code ...
+	else
+	{
	... existing code ...
+	}

The next pgindent run will fix the funny indentation, or the committer
can do it if he wishes after reviewing.

The efficient cloning is accomplished by a Btrfs-specific ioctl() call.

The big-picture question of course is whether we want to carry and
maintain a filesystem-specific hack. I don't have a sense that btrfs
is so widely used as to justify this.

+#ifdef __linux__
+# define BTRFS_IOCTL_MAGIC 0x94
+# define BTRFS_IOC_CLONE _IOW (BTRFS_IOCTL_MAGIC, 9, int)
+	return ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);
+#else

This seems to me to be unacceptable on its face. If we can't get these
constants out of a system header file, it's unlikely that the feature is
stable enough to depend on, if indeed it's meant for general-purpose use
at all. We could easily end up invoking unexpected behaviors.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jonathan Rogers

jrogers@socialserve.com

almost 13 years ago

In reply to: Tom Lane (#2)

Re: Btrfs clone WIP patch

Tom Lane wrote:

Jonathan Rogers <jrogers@socialserve.com> writes:

This patch against PostgreSQL 9.1.8 takes advantage of efficient file
cloning on Linux Btrfs file systems to make CREATE DATABASE operations
extremely fast regardless of the size of the database used as a
template.

It would be easier to review this patch if the bulk of it weren't simple
reindentation of existing code. (Or at least it ought to be that ---
I object to your having moved the buffer palloc inside the loop. A
patch that is trying to optimize a minority case can expect to be
rejected if it makes things worse for everyone else.)

The buffer allocation is actually not inside the loop, but inside the if
branch for ordinary copying behavior since the buffer is unnecessary in
the case of a successful clone.

Consider whether you can't phrase the patch to avoid that, perhaps by
use of "continue" instead of an else-block. Alternatively, enclose the
existing code in braces but don't reindent it, ie,
+	if (whatever)
+		... new code ...
+	else
+	{
... existing code ...
+	}

Indeed, I was bothered by the need to reindent so much as well. I'll see
if I can do better.

The next pgindent run will fix the funny indentation, or the committer
can do it if he wishes after reviewing.

The efficient cloning is accomplished by a Btrfs-specific ioctl() call.

The big-picture question of course is whether we want to carry and
maintain a filesystem-specific hack. I don't have a sense that btrfs
is so widely used as to justify this.

Yes, this is a problem I considered. I think the basic problem is the
lack of any kind of generic interface to copy or clone a file. A system
call for Linux to copy or clone has been proposed more than once but so
far, nothing has been accepted. I believe there are a few file systems
that support some kind of efficient cloning, but I haven't investigated
it deeply.

+#ifdef __linux__
+# define BTRFS_IOCTL_MAGIC 0x94
+# define BTRFS_IOC_CLONE _IOW (BTRFS_IOCTL_MAGIC, 9, int)
+	return ioctl (dest_fd, BTRFS_IOC_CLONE, src_fd);
+#else
This seems to me to be unacceptable on its face. If we can't get these
constants out of a system header file, it's unlikely that the feature is
stable enough to depend on, if indeed it's meant for general-purpose use
at all. We could easily end up invoking unexpected behaviors.

Of course you're right that defining values right there is no good. It
looks like the values are in the Linux headers since 2.6.32 when Btrfs
was merged into mainline. I guess I'll need to brush up on CPP to figure
out how to use the Linux header values if they exist.

Would it be better to move clone_file() into its own module where
implementations for other file system types might eventually be added?

My first implementation called cp with the "--reflink=auto" option since
that seems to be the closest thing to a file system agnostic interface.
The above snippet comes directly from the GNU cp source and I'm not sure
why that code defines the values instead of taking them from Linux headers.

--
Jonathan Ross Rogers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Tom Lane (#2)

Re: Btrfs clone WIP patch

On 02/13/2013 02:13 PM, Tom Lane wrote:

The big-picture question of course is whether we want to carry and
maintain a filesystem-specific hack. I don't have a sense that btrfs
is so widely used as to justify this.

If this is a valuable hack, it seems like it could work on ZFS as well.
If we could make it for any snapshot-capable filesystem, and not just
BTRFS, then it would make more sense.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jonathan Rogers

jrogers@socialserve.com

almost 13 years ago

In reply to: Josh Berkus (#4)

Re: Btrfs clone WIP patch

Josh Berkus wrote:

On 02/13/2013 02:13 PM, Tom Lane wrote:

The big-picture question of course is whether we want to carry and
maintain a filesystem-specific hack. I don't have a sense that btrfs
is so widely used as to justify this.

If this is a valuable hack, it seems like it could work on ZFS as well.
If we could make it for any snapshot-capable filesystem, and not just
BTRFS, then it would make more sense.

Yes, that's exactly what I hope will be possible. I need to investigate
ZFS to see if it's feasible.

--
Jonathan Ross Rogers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Phil Sorber

phil@omniti.com

almost 13 years ago

In reply to: Josh Berkus (#4)

Re: Btrfs clone WIP patch

On Wed, Feb 13, 2013 at 5:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 02/13/2013 02:13 PM, Tom Lane wrote:

The big-picture question of course is whether we want to carry and
maintain a filesystem-specific hack. I don't have a sense that btrfs
is so widely used as to justify this.

If this is a valuable hack, it seems like it could work on ZFS as well.
If we could make it for any snapshot-capable filesystem, and not just
BTRFS, then it would make more sense.

I was thinking that too, but I think this is a file level clone, not a
whole filesystem. As far as I can tell, you can't clone individual
files in ZFS.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Jonathan Rogers (#3)

Re: Btrfs clone WIP patch

Jonathan Rogers <jrogers@socialserve.com> writes:

Would it be better to move clone_file() into its own module where
implementations for other file system types might eventually be added?

Yeah, possibly. I considered suggesting that the current code be
treated as a fallback implementation of clone_file, but I'm not sure
if there's a convenient way to manage the run-time fallback if we try
to do it like that. In any case, +1 for leaving the door open for
easy addition of other cloning techniques.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jonathan Rogers

jrogers@socialserve.com

almost 13 years ago

In reply to: Phil Sorber (#6)

1 attachment(s)

Re: Btrfs clone WIP patch

Phil Sorber wrote:

On Wed, Feb 13, 2013 at 5:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 02/13/2013 02:13 PM, Tom Lane wrote:

The big-picture question of course is whether we want to carry and
maintain a filesystem-specific hack. I don't have a sense that btrfs
is so widely used as to justify this.

If this is a valuable hack, it seems like it could work on ZFS as well.
If we could make it for any snapshot-capable filesystem, and not just
BTRFS, then it would make more sense.

I was thinking that too, but I think this is a file level clone, not a
whole filesystem. As far as I can tell, you can't clone individual
files in ZFS.

I've been thinking about both of these issues and decided to try a
different approach. This patch adds GUC options for two external
commands: one to copy a directory and one to delete a directory. This
allows filesystem-specific tools to be used to accomplish the efficient
cloning without Postgres having to know any details.

This works particularly well for Btrfs. On a GNU/Linux system, one can
simply configure the external copy command as "/bin/cp -r
--reflink=auto" and efficient cloning will be done on file systems that
support it and ordinary copying will be done otherwise. The directory
deletion command isn't needed and no special Postgres setup is required
other than putting the data directory on a Btrfs file system.

I have just been experimenting with ZFS and it does not seem to have any
capability or interface for cloning ordinary files or directories so the
configuration is not as straightforward. However, I was able to set up a
Postgres cluster as a hierarchy of ZFS file systems in the same pool
with each directory under "base" being a separate file system and
configure Postgres to call shell scripts which call zfs snapshot and
clone commands to do the cloning and deleting.

In either case, the directories are copied recursively while the
Postgres internal copydir function does not recurse. I don't think that
should be a problem since there shouldn't be nested directories in the
first place.
--
Jonathan Ross Rogers

Attachments:

external_copy.patchtext/x-patch; name=external_copy.patchDownload

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 37bd0a4..5b87eb1 100644
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
***************
*** 23,28 ****
--- 23,29 ----
  #include <locale.h>
  #include <unistd.h>
  #include <sys/stat.h>
+ #include <sys/wait.h>
  
  #include "access/genam.h"
  #include "access/heapam.h"
***************
*** 44,49 ****
--- 45,51 ----
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "postmaster/fork_process.h"
  #include "storage/bufmgr.h"
  #include "storage/copydir.h"
  #include "storage/fd.h"
***************
*** 608,614 ****
  			 *
  			 * We don't need to copy subdirectories
  			 */
! 			copydir(srcpath, dstpath, false);
  
  			/* Record the filesystem change in XLOG */
  			{
--- 610,619 ----
  			 *
  			 * We don't need to copy subdirectories
  			 */
! 			if (external_copy_command)
! 				external_copydir(srcpath, dstpath);
! 			else
! 				copydir(srcpath, dstpath, false);
  
  			/* Record the filesystem change in XLOG */
  			{
***************
*** 1702,1707 ****
--- 1707,1767 ----
  	return result;
  }
  
+ 
+ #define MAX_RM_TABLESPACE_WORDS 10
+ char *rm_tablespace_dir_command = NULL;
+ 
+ 
+ /* Remove a single tablespace directory by calling an external command */
+ void
+ external_rm_tablespace_dir(char *dir)
+ {
+ 	char	   *cmd_path;
+ 	/* leave space for directory names and terminator */
+ 	char	   *argv[MAX_RM_TABLESPACE_WORDS + 3];
+ 	int			argc = 0;
+ 	pid_t		pid;
+ 	int			status;
+ 
+ 	char	   *sc = strdup(rm_tablespace_dir_command);
+ 	char	   *token;
+ 
+ 	Assert(sc);
+ 	token = strtok(sc, " ");
+ 	while (token)
+ 	{
+ 		ereport(DEBUG1, (errmsg("Appending \"%s\"", token)));
+ 		argv[argc++] = token;
+ 		Assert(argc <= MAX_RM_TABLESPACE_WORDS);
+ 		token = strtok(NULL, " ");
+ 	}
+ 	Assert(argc > 0);
+ 	argv[argc++] = dir;
+ 	cmd_path = argv[0];
+ 	ereport(DEBUG1, (errmsg("Calling %s to delete \"%s\"", cmd_path, dir)));
+ 	argv[argc] = NULL;
+ 	Assert(argv[argc] == NULL);
+ 
+ 	/* Fire off execv in child */
+ 	if ((pid = fork_process()) == 0)
+ 	{
+ 		if (execv(cmd_path, argv) < 0)
+ 		{
+ 			ereport(LOG,
+ 					(errmsg("could not execute \"%s\"", cmd_path)));
+ 			/* We're already in the child process here, can't return */
+ 			exit(1);
+ 		}
+ 	}
+ 	else {
+ 		waitpid(pid, &status, 0);
+ 		if (status) {
+ 		ereport(ERROR, (errmsg("could not delete dir \"%s\"", dir)));
+ 		}
+ 	}
+ 	free(sc);
+ }
+ 
  /*
   * Remove tablespace directories
   *
***************
*** 1747,1753 ****
  			continue;
  		}
  
! 		if (!rmtree(dstpath, true))
  			ereport(WARNING,
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dstpath)));
--- 1807,1815 ----
  			continue;
  		}
  
! 		if (rm_tablespace_dir_command)
! 			external_rm_tablespace_dir(dstpath);
! 		else if (!rmtree(dstpath, true))
  			ereport(WARNING,
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dstpath)));
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 6cfb816..f6e579f 100644
*** a/src/backend/storage/file/copydir.c
--- b/src/backend/storage/file/copydir.c
***************
*** 21,31 ****
--- 21,34 ----
  #include <fcntl.h>
  #include <unistd.h>
  #include <sys/stat.h>
+ #include <sys/wait.h>
  
  #include "storage/copydir.h"
  #include "storage/fd.h"
  #include "miscadmin.h"
  
+ #include "postmaster/fork_process.h"
+ 
  /*
   *	On Windows, call non-macro versions of palloc; we can't reference
   *	CurrentMemoryContext in this file because of PGDLLIMPORT conflict.
***************
*** 40,45 ****
--- 43,106 ----
  
  static void fsync_fname(char *fname, bool isdir);
  
+ #define MAX_COPY_COMMAND_WORDS 10
+ char *external_copy_command = NULL;
+ 
+ 
+ /*
+  * copy directory using external command
+  */
+ void
+ external_copydir(char *fromdir, char *todir)
+ {
+ 	char	   *cp_path;
+ 	/* leave space for directory names and terminator */
+ 	char	   *argv[MAX_COPY_COMMAND_WORDS + 3];
+ 	int			argc = 0;
+ 	pid_t		pid;
+ 	int			status;
+ 
+ 	char	   *sc = strdup(external_copy_command);
+ 	char	   *token;
+ 
+ 	Assert(sc);
+ 	token = strtok(sc, " ");
+ 	while (token)
+ 	{
+ 		ereport(DEBUG1, (errmsg("Appending \"%s\"", token)));
+ 		argv[argc++] = token;
+ 		Assert(argc <= MAX_COPY_COMMAND_WORDS);
+ 		token = strtok(NULL, " ");
+ 	}
+ 	Assert(argc > 0);
+ 	argv[argc++] = fromdir;
+ 	argv[argc++] = todir;
+ 	cp_path = argv[0];
+ 	ereport(DEBUG1, (errmsg("Calling %s to copy dir \"%s\" to \"%s\"",
+ 							cp_path, fromdir, todir)));
+ 	argv[argc] = NULL;
+ 	Assert(argv[argc] == NULL);
+ 
+ 	/* Fire off execv in child */
+ 	if ((pid = fork_process()) == 0)
+ 	{
+ 		if (execv(cp_path, argv) < 0)
+ 		{
+ 			ereport(LOG,
+ 					(errmsg("could not execute \"%s\"", cp_path)));
+ 			/* We're already in the child process here, can't return */
+ 			exit(1);
+ 		}
+ 	}
+ 	else {
+ 		waitpid(pid, &status, 0);
+ 		if (status) {
+ 		ereport(ERROR, (errmsg("could not copy dir \"%s\" \"%s\"",
+ 							   fromdir, todir)));
+ 		}
+ 	}
+ 	free(sc);
+ }
  
  /*
   * copydir: copy a directory
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a363441..0203cc2 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 32,37 ****
--- 32,38 ----
  #include "access/xact.h"
  #include "catalog/namespace.h"
  #include "commands/async.h"
+ #include "commands/dbcommands.h"
  #include "commands/prepare.h"
  #include "commands/vacuum.h"
  #include "commands/variable.h"
***************
*** 59,64 ****
--- 60,66 ----
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
  #include "storage/bufmgr.h"
+ #include "storage/copydir.h"
  #include "storage/standby.h"
  #include "storage/fd.h"
  #include "storage/predicate.h"
***************
*** 607,612 ****
--- 609,616 ----
  	gettext_noop("Customized Options"),
  	/* DEVELOPER_OPTIONS */
  	gettext_noop("Developer Options"),
+ 	/* FILE_OPERATION_OPTIONS */
+ 	gettext_noop("File Operation Options"),
  	/* help_config wants this array to be null-terminated */
  	NULL
  };
***************
*** 2568,2573 ****
--- 2572,2599 ----
  	},
  
  	{
+ 		{"external_copy_command", PGC_BACKEND, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets external command to copy a file."),
+ 			NULL,
+ 			GUC_NOT_IN_SAMPLE
+ 		},
+ 		&external_copy_command,
+ 		NULL,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"rm_tablespace_dir_command", PGC_BACKEND, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets external command to remove tablespace directory."),
+ 			NULL,
+ 			GUC_NOT_IN_SAMPLE
+ 		},
+ 		&rm_tablespace_dir_command,
+ 		NULL,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"default_tablespace", PGC_USERSET, CLIENT_CONN_STATEMENT,
  			gettext_noop("Sets the default tablespace to create tables and indexes in."),
  			gettext_noop("An empty string selects the database's default tablespace."),
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 21dacff..199c53f 100644
*** a/src/include/commands/dbcommands.h
--- b/src/include/commands/dbcommands.h
***************
*** 66,70 ****
--- 66,71 ----
  extern void dbase_desc(StringInfo buf, uint8 xl_info, char *rec);
  
  extern void check_encoding_locale_matches(int encoding, const char *collate, const char *ctype);
+ extern char *rm_tablespace_dir_command;
  
  #endif   /* DBCOMMANDS_H */
diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index 7e9f1f4..31edf97 100644
*** a/src/include/storage/copydir.h
--- b/src/include/storage/copydir.h
***************
*** 15,19 ****
--- 15,22 ----
  
  extern void copydir(char *fromdir, char *todir, bool recurse);
  extern void copy_file(char *fromfile, char *tofile);
+ extern void external_copydir(char *fromdir, char *todir);
+ 
+ extern char *external_copy_command;
  
  #endif   /* COPYDIR_H */

Greg Smith

greg@2ndQuadrant.com

almost 13 years ago

In reply to: Jonathan Rogers (#8)

Re: Btrfs clone WIP patch

On 3/1/13 1:40 AM, Jonathan Rogers wrote:

I've been thinking about both of these issues and decided to try a
different approach. This patch adds GUC options for two external
commands

This is a reasonable approach for a proof of concept patch. I like the
idea you're playing with here, as a useful boost for both production
deployments and development synchronization jobs. There are some major
coding hurdles for this idea to go beyond proof of concept into
committed feature though, so I'm just going to dump all the obvious ones
I can see on you too. Feel free to ignore those and focus on the fun
usability / capability parts for a while first. Just want you to
realize what work is in your way, but not visible to you necessarily.

With so many GUCs already, adding more of them to support something with
a fairly narrow user base will be hard to justify committing. This sort
of thing seems like you could load it as an extension instead. I don't
think this is worth doing yet. For now the GUC approach is fine, and it
keeps your code sample small.

I'm going to start with how to continue validating this idea is useful
without worrying about those, to get some more testing/benchmarking
samples, and for all of that job what you're doing so far is fine. The
road toward something to commit is going to take a lot more complicated
code eventually though. There's no big conceptual leap involved, it's
just where the changes need to go and how much they need to touch to do
this accurately in all cases isn't as simple as your demo. Handling
unusual situations and race conditions turns out to be almost all the
code in what you're replacing, and you'll need similar work in the
replacements.

I have just been experimenting with ZFS and it does not seem to have any
capability or interface for cloning ordinary files or directories so the
configuration is not as straightforward. However, I was able to set up a
Postgres cluster as a hierarchy of ZFS file systems in the same pool
with each directory under "base" being a separate file system and
configure Postgres to call shell scripts which call zfs snapshot and
clone commands to do the cloning and deleting.

To make things easier for a reviewer, it's useful to include working
examples like this as part of your submission. For this one that would
include:

-A sample script that created a test pool
-Working postgresql.conf changes compatible with ZFS
-Test program showing how to exercise the feature

I think I can see how to construct such an example for the btrfs
version, but having you show that explicitly (preferably with a whole
sample session executing it) will also help reviewers. Remember: if
you want to get your submission off to a good start, the reviewer should
be able to run your sample test, see the code work, and do something fun
within a few seconds of compiling it. Make that easy for them, and your
reviewer will start with a good impression of you and a positive outlook
for the change.

Now onto the code nitpicking!

= Extension vs. GUC =

In addition to not polluting the postgresql.conf.sample, there's another
reason this might make for better extension material eventually. Adding
new execv calls that are building strings like this is just generally a
risky thing. It would be nice from a security perspective if that
entire mechanism wasn't even introduced into the server at all unless
someone loaded the extension.

An extension implementation will end up being more code, both to add a
useful hook for replace these calls and for the extension packaging
itself. Having a bit more code in contrib/ but less in src &
postgresql.conf is probably a net positive trade though.

= Diff reduction / code refactoring =

Looks like you added a "File Operation Options" entry into guc.c but
then not use it here? I would just keep this in an existing category
for now, try to reduce the diff length of the proof of concept version
as much as possible in the beginning.

On the topic of smaller diffs, the similar cut and paste sections of the
two entries that both do fork/exec/waitpid should really be refactored
into one function. The archiver does something similar for running
archive_command, there may be some reuse or refactoring of its code to
present this interface.

Again, this sort of refactoring is not necessary as a POC patch. But it
will probably come up if this moves toward commit candidate.

= Recursion and directory navigation =

In either case, the directories are copied recursively while the
Postgres internal copydir function does not recurse. I don't think that
should be a problem since there shouldn't be nested directories in the
first place.

copydir takes an option for whether it should recurse or not.

The rm side of makes me twitch for a number of reasons. First off,
there's just the general scariness of the concept of shelling out to run
rm recursively with some text string you build. The worst time I saw a
bug in that sort of code destroyed a terabyte, and the last time I saw
such a bug was only a week ago. Validation you're doing the right thing
is always job #1 in removing files.

Specifically here, take a look at src/port/dirmod.c for a) its comments
on race conditions and b) how it does error handling. The external rm
will need to duplicate all of that. I don't see how you could possibly
replace rmtree with something simplier that has the same necessary
properties. I think what you're actually going to need is a hook to
replace both the unlink and rmdir calls with your alternate
implementation idea.

The same sort of issue is in your external_copydir. Iterating into
subdirectories when it doesn't happen now just isn't safe, even though
the one expected case you're hooking won't be any different. You really
can't just do that. Would this work instead, and is there any concern
about files that start with a "."?

"cp * --reflink=auto"

Regardless, you need to keep most of the structure to copydir anyway.
Error handling, handling cancellation, and fsync calls are all vital
things. You probably have to make the forked command copy a single file
at a time to get the same interrupt handling behavior.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Jonathan Rogers

jrogers@socialserve.com

almost 13 years ago

In reply to: Greg Smith (#9)

Re: Btrfs clone WIP patch

Greg Smith wrote:

I think I can see how to construct such an example for the btrfs
version, but having you show that explicitly (preferably with a whole
sample session executing it) will also help reviewers. Remember: if
you want to get your submission off to a good start, the reviewer should
be able to run your sample test, see the code work, and do something fun
within a few seconds of compiling it. Make that easy for them, and your
reviewer will start with a good impression of you and a positive outlook
for the change.

Yes, an example is very simple with Btrfs, since it only requires one
GUC variable and that the cluster be created on a Btrfs file system.
Constructing a ZFS is decidedly non-trivial and I'm starting to question
if it's worth it.

Now onto the code nitpicking!

= Extension vs. GUC =

In addition to not polluting the postgresql.conf.sample, there's another
reason this might make for better extension material eventually. Adding
new execv calls that are building strings like this is just generally a
risky thing. It would be nice from a security perspective if that
entire mechanism wasn't even introduced into the server at all unless
someone loaded the extension.

An extension implementation will end up being more code, both to add a
useful hook for replace these calls and for the extension packaging
itself. Having a bit more code in contrib/ but less in src &
postgresql.conf is probably a net positive trade though.

I will look into how to write an extension.

= Diff reduction / code refactoring =

Looks like you added a "File Operation Options" entry into guc.c but
then not use it here? I would just keep this in an existing category
for now, try to reduce the diff length of the proof of concept version
as much as possible in the beginning.

Yes, that is a good idea.

On the topic of smaller diffs, the similar cut and paste sections of the
two entries that both do fork/exec/waitpid should really be refactored
into one function. The archiver does something similar for running
archive_command, there may be some reuse or refactoring of its code to
present this interface.

I'll look at archive_command to see what might be in common.

Again, this sort of refactoring is not necessary as a POC patch. But it
will probably come up if this moves toward commit candidate.

= Recursion and directory navigation =

In either case, the directories are copied recursively while the
Postgres internal copydir function does not recurse. I don't think that
should be a problem since there shouldn't be nested directories in the
first place.

copydir takes an option for whether it should recurse or not.

The rm side of makes me twitch for a number of reasons. First off,
there's just the general scariness of the concept of shelling out to run
rm recursively with some text string you build. The worst time I saw a
bug in that sort of code destroyed a terabyte, and the last time I saw
such a bug was only a week ago. Validation you're doing the right thing
is always job #1 in removing files.

I needed to call an external command to remove a directory only when
experimenting on ZFS since the regular implementation works fine on
Btrfs. Unlike Btrfs, ZFS does not have any capability to clone
individual files. Therefore, a directory under base/ has to start out as
a ZFS snapshot, which can be cloned with the "zfs clone" command to
become a snapshot directory with exactly the same contents. To remove
the clone, the "zfs destroy" command has to be called on the clone.
AFAICT, clone and remove operations on ZFS always operate on a whole
directory at a time.

The same sort of issue is in your external_copydir. Iterating into
subdirectories when it doesn't happen now just isn't safe, even though
the one expected case you're hooking won't be any different. You really
can't just do that. Would this work instead, and is there any concern
about files that start with a "."?

"cp * --reflink=auto"

Regardless, you need to keep most of the structure to copydir anyway.
Error handling, handling cancellation, and fsync calls are all vital
things. You probably have to make the forked command copy a single file
at a time to get the same interrupt handling behavior.

In an earlier implementation, I did call "cp --reflink=auto" once per
regular file, preserving the behavior of copydir. On Btrfs, this works
well, though slightly slower due to extra processes. AFAIK, there's no
way to do something equivalent on ZFS without coming up with a much more
complicated scheme involving both links and clones.

I don't think it will be possible to implement a scheme that works on
ZFS and addresses your concerns about file and directory handling that
is not many times more complex than what I have so far. OTOH, I think
the approach I have already implemented which calls an external command
for each regular file to copy might be acceptable. Since I don't
personally have much interest in using ZFS, I think I'm going to just
focus on Btrfs for now.

--
Jonathan Ross Rogers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Greg Smith

greg@2ndQuadrant.com

almost 13 years ago

In reply to: Jonathan Rogers (#10)

Re: Btrfs clone WIP patch

On 3/9/13 11:39 PM, Jonathan Rogers wrote:

In an earlier implementation, I did call "cp --reflink=auto" once per
regular file, preserving the behavior of copydir. On Btrfs, this works
well, though slightly slower due to extra processes. AFAIK, there's no
way to do something equivalent on ZFS without coming up with a much more
complicated scheme involving both links and clones.

Chasing after the performance win of the copy acceleration seems worth
it. We can't really do that if it results in an obvious behavior change
though. If it takes a bit longer to do it file at a time, but that's
the right thing to do, as long as it's still a net win this idea is
still worthwhile.

I don't think it will be possible to implement a scheme that works on
ZFS and addresses your concerns about file and directory handling that
is not many times more complex than what I have so far. OTOH, I think
the approach I have already implemented which calls an external command
for each regular file to copy might be acceptable. Since I don't
personally have much interest in using ZFS, I think I'm going to just
focus on Btrfs for now.

Just be aware that OS specific features like this are a lot more likely
to be accepted if they're demonstrated to work on two OSes. That's much
better evidence that the API for the OS specific work is a good one.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers