pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

Started by Michael Paquierover 9 years ago28 messages
#1Michael Paquier
michael.paquier@gmail.com
2 attachment(s)

Hi all,

Beginning a new thread because the ext4 issues are closed, and because
pg_basebackup data durability meritates a new thread. And in short
about the problem: pg_basebackup makes no effort in being sure that
the data it backs up is on disk, which is bad... One possible
recommendation is to use initdb -S after running pg_basebackup, but
making sure that data is on disk should be done before pg_basebackup
ends.

On Thu, May 12, 2016 at 8:09 PM, I wrote:

And actually this won't fly high if there is no equivalent of
walkdir() or if the fsync()'s are not applied recursively. On master
at least the refactoring had better be done cleanly first... For the
back branches, we could just have some recursive call like
fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
you think that this should be part of fe_utils or src/common/? I'd
tend to think the latter is more adapted as there is an equivalent in
the backend. On back-branches, we could just have something like
fsync_recursively that walks though the paths. An even more simple
approach would be to fsync() individually things that have been
written, but that would suck in performance.

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

With pg_basebackup -X and pg_receivexlog, the manipulation of WAL
files is made durable by using fsync and durable_rename where needed
(credits to Andres mainly for this part).

This set of patches is aimed only at HEAD. Back-patchable versions of
this patch would need to copy fsync_pgdata and friends into
streamutil.c for example.

I am adding that to the next CF for review as a bug fix.
Regards,
--
Michael

Attachments:

0001-Relocation-fsync-routines-of-initdb-into-fe_utils.patchapplication/x-download; name=0001-Relocation-fsync-routines-of-initdb-into-fe_utils.patchDownload
From a5301655dd927e2c37c0b6156c24568bdba6eac3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 13 May 2016 14:27:04 +0900
Subject: [PATCH 1/2] Relocation fsync routines of initdb into fe_utils

Those are aimed at being used by other utilities, pg_basebackup mainly,
and at some other degree pg_dump and pg_receivexlog.
---
 src/bin/initdb/Makefile           |   3 +-
 src/bin/initdb/initdb.c           | 266 +-----------------------------------
 src/fe_utils/Makefile             |   2 +-
 src/fe_utils/file_utils.c         | 279 ++++++++++++++++++++++++++++++++++++++
 src/include/fe_utils/file_utils.h |  22 +++
 src/tools/msvc/Mkvcbuild.pm       |   3 +-
 6 files changed, 313 insertions(+), 262 deletions(-)
 create mode 100644 src/fe_utils/file_utils.c
 create mode 100644 src/include/fe_utils/file_utils.h

diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index 094c894..e6b3f72 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -17,6 +17,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 override CPPFLAGS := -DFRONTEND -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
+LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils
 
 # use system timezone data?
 ifneq (,$(with_system_tzdata))
@@ -27,7 +28,7 @@ OBJS=	initdb.o findtimezone.o localtime.o encnames.o $(WIN32RES)
 
 all: initdb
 
-initdb: $(OBJS) | submake-libpgport
+initdb: $(OBJS) | submake-libpgport submake-libpgfeutils
 	$(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
 # We used to pull in all of libpq to get encnames.c, but that
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ec8c38e..9519902 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -63,19 +63,13 @@
 #include "catalog/catalog.h"
 #include "common/restricted_token.h"
 #include "common/username.h"
+#include "fe_utils/file_utils.h"
 #include "mb/pg_wchar.h"
 #include "getaddrinfo.h"
 #include "getopt_long.h"
 #include "miscadmin.h"
 
 
-/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
-#if defined(HAVE_SYNC_FILE_RANGE)
-#define PG_FLUSH_DATA_WORKS 1
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-#define PG_FLUSH_DATA_WORKS 1
-#endif
-
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
@@ -235,13 +229,6 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks);
-#ifdef PG_FLUSH_DATA_WORKS
-static void pre_sync_fname(const char *fname, bool isdir);
-#endif
-static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -268,7 +255,6 @@ static void load_plpgsql(FILE *cmdfd);
 static void vacuum_db(FILE *cmdfd);
 static void make_template0(FILE *cmdfd);
 static void make_postgres(FILE *cmdfd);
-static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -535,177 +521,6 @@ writefile(char *path, char **lines)
 }
 
 /*
- * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself).
- *
- * If process_symlinks is true, the action and recursion are also applied
- * to regular files and directories that are pointed to by symlinks in the
- * given directory; otherwise symlinks are ignored.  Symlinks are always
- * ignored in subdirectories, ie we intentionally don't pass down the
- * process_symlinks flag to recursive calls.
- *
- * Errors are reported but not considered fatal.
- *
- * See also walkdir in fd.c, which is a backend version of this logic.
- */
-static void
-walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
-{
-	DIR		   *dir;
-	struct dirent *de;
-
-	dir = opendir(path);
-	if (dir == NULL)
-	{
-		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-		return;
-	}
-
-	while (errno = 0, (de = readdir(dir)) != NULL)
-	{
-		char		subpath[MAXPGPATH];
-		struct stat fst;
-		int			sret;
-
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
-			continue;
-
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
-
-		if (process_symlinks)
-			sret = stat(subpath, &fst);
-		else
-			sret = lstat(subpath, &fst);
-
-		if (sret < 0)
-		{
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, subpath, strerror(errno));
-			continue;
-		}
-
-		if (S_ISREG(fst.st_mode))
-			(*action) (subpath, false);
-		else if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action, false);
-	}
-
-	if (errno)
-		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-
-	(void) closedir(dir);
-
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
-	(*action) (path, true);
-}
-
-/*
- * Hint to the OS that it should get ready to fsync() this file.
- *
- * Ignores errors trying to open unreadable files, and reports other errors
- * non-fatally.
- */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static void
-pre_sync_fname(const char *fname, bool isdir)
-{
-	int			fd;
-
-	fd = open(fname, O_RDONLY | PG_BINARY);
-
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	/*
-	 * We do what pg_flush_data() would do in the backend: prefer to use
-	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
-	 * because this is only a hint.
-	 */
-#if defined(HAVE_SYNC_FILE_RANGE)
-	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
-#else
-#error PG_FLUSH_DATA_WORKS should not have been defined
-#endif
-
-	(void) close(fd);
-}
-
-#endif   /* PG_FLUSH_DATA_WORKS */
-
-/*
- * fsync_fname_ext -- Try to fsync a file or directory
- *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required.  Reports
- * other errors non-fatally.
- */
-static void
-fsync_fname_ext(const char *fname, bool isdir)
-{
-	int			fd;
-	int			flags;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
-	 * not writable by our userid, but we assume that's OK.
-	 */
-	flags = PG_BINARY;
-	if (!isdir)
-		flags |= O_RDWR;
-	else
-		flags |= O_RDONLY;
-
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
-	fd = open(fname, flags);
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	returncode = fsync(fd);
-
-	/*
-	 * Some OSes don't allow us to fsync directories at all, so we can ignore
-	 * those errors. Anything else needs to be reported.
-	 */
-	if (returncode != 0 && !(isdir && errno == EBADF))
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-
-	(void) close(fd);
-}
-
-/*
  * Open a subcommand with suitable error messaging
  */
 static FILE *
@@ -2297,77 +2112,6 @@ make_postgres(FILE *cmdfd)
 		PG_CMD_PUTS(*line);
 }
 
-/*
- * Issue fsync recursively on PGDATA and all its contents.
- *
- * We fsync regular files and directories wherever they are, but we
- * follow symlinks only for pg_xlog and immediately under pg_tblspc.
- * Other symlinks are presumed to point at files we're not responsible
- * for fsyncing, and might not have privileges to write at all.
- *
- * Errors are reported but not considered fatal.
- */
-static void
-fsync_pgdata(void)
-{
-	bool		xlog_is_symlink;
-	char		pg_xlog[MAXPGPATH];
-	char		pg_tblspc[MAXPGPATH];
-
-	fputs(_("syncing data to disk ... "), stdout);
-	fflush(stdout);
-
-	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
-	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
-
-	/*
-	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
-	 * because the first walkdir below will ignore it.
-	 */
-	xlog_is_symlink = false;
-
-#ifndef WIN32
-	{
-		struct stat st;
-
-		if (lstat(pg_xlog, &st) < 0)
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, pg_xlog, strerror(errno));
-		else if (S_ISLNK(st.st_mode))
-			xlog_is_symlink = true;
-	}
-#else
-	if (pgwin32_is_junction(pg_xlog))
-		xlog_is_symlink = true;
-#endif
-
-	/*
-	 * If possible, hint to the kernel that we're soon going to fsync the data
-	 * directory and its contents.
-	 */
-#ifdef PG_FLUSH_DATA_WORKS
-	walkdir(pg_data, pre_sync_fname, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, pre_sync_fname, false);
-	walkdir(pg_tblspc, pre_sync_fname, true);
-#endif
-
-	/*
-	 * Now we do the fsync()s in the same order.
-	 *
-	 * The main call ignores symlinks, so in addition to specially processing
-	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
-	 * process_symlinks = true.  Note that if there are any plain directories
-	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
-	 * so we don't worry about optimizing it.
-	 */
-	walkdir(pg_data, fsync_fname_ext, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, fsync_fname_ext, false);
-	walkdir(pg_tblspc, fsync_fname_ext, true);
-
-	check_ok();
-}
 
 
 /*
@@ -3534,7 +3278,8 @@ main(int argc, char *argv[])
 			exit_nicely();
 		}
 
-		fsync_pgdata();
+		fsync_pgdata(pg_data, progname);
+		check_ok();
 		return 0;
 	}
 
@@ -3593,7 +3338,10 @@ main(int argc, char *argv[])
 	initialize_data_directory();
 
 	if (do_sync)
-		fsync_pgdata();
+	{
+		fsync_pgdata(pg_data, progname);
+		check_ok();
+	}
 	else
 		printf(_("\nSync to disk skipped.\nThe data directory might become corrupt if the operating system crashes.\n"));
 
diff --git a/src/fe_utils/Makefile b/src/fe_utils/Makefile
index 9da03b1..51a7804 100644
--- a/src/fe_utils/Makefile
+++ b/src/fe_utils/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 
 override CPPFLAGS := -DFRONTEND -I$(libpq_srcdir) $(CPPFLAGS)
 
-OBJS = mbprint.o print.o psqlscan.o simple_list.o string_utils.o
+OBJS = file_utils.o mbprint.o print.o psqlscan.o simple_list.o string_utils.o
 
 all: libpgfeutils.a
 
diff --git a/src/fe_utils/file_utils.c b/src/fe_utils/file_utils.c
new file mode 100644
index 0000000..80f6dbd
--- /dev/null
+++ b/src/fe_utils/file_utils.c
@@ -0,0 +1,279 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines for frontend code
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/fe_utils/file_utils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "fe_utils/file_utils.h"
+
+
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir,
+						   const char *progname);
+#endif
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir, const char *progname),
+	bool process_symlinks, const char *progname);
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and immediately under pg_tblspc.
+ * Other symlinks are presumed to point at files we're not responsible
+ * for fsyncing, and might not have privileges to write at all.
+ *
+ * Errors are reported but not considered fatal.
+ */
+void
+fsync_pgdata(const char *pg_data, const char *progname)
+{
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+
+	fputs(_("syncing data to disk ... "), stdout);
+	fflush(stdout);
+
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
+
+	/*
+	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+#ifndef WIN32
+	{
+		struct stat st;
+
+		if (lstat(pg_xlog, &st) < 0)
+			fprintf(stderr, _("could not stat file \"%s\": %s\n"),
+					pg_xlog, strerror(errno));
+		else if (S_ISLNK(st.st_mode))
+			xlog_is_symlink = true;
+	}
+#else
+	if (pgwin32_is_junction(pg_xlog))
+		xlog_is_symlink = true;
+#endif
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync the data
+	 * directory and its contents.
+	 */
+#ifdef PG_FLUSH_DATA_WORKS
+	walkdir(pg_data, pre_sync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, pre_sync_fname, false, progname);
+	walkdir(pg_tblspc, pre_sync_fname, true, progname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * The main call ignores symlinks, so in addition to specially processing
+	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
+	 * process_symlinks = true.  Note that if there are any plain directories
+	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
+	 * so we don't worry about optimizing it.
+	 */
+	walkdir(pg_data, fsync_fname_ext, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, fsync_fname_ext, false, progname);
+	walkdir(pg_tblspc, fsync_fname_ext, true, progname);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * If process_symlinks is true, the action and recursion are also applied
+ * to regular files and directories that are pointed to by symlinks in the
+ * given directory; otherwise symlinks are ignored.  Symlinks are always
+ * ignored in subdirectories, ie we intentionally don't pass down the
+ * process_symlinks flag to recursive calls.
+ *
+ * Errors are reported but not considered fatal.
+ *
+ * See also walkdir in fd.c, which is a backend version of this logic.
+ */
+static void
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir, const char *progname),
+		bool process_symlinks, const char *progname)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+		return;
+	}
+
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+		int			sret;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, &fst);
+		else
+			sret = lstat(subpath, &fst);
+
+		if (sret < 0)
+		{
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, subpath, strerror(errno));
+			continue;
+		}
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false, progname);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false, progname);
+	}
+
+	if (errno)
+		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+
+	(void) closedir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true, progname);
+}
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Ignores errors trying to open unreadable files, and reports other errors
+ * non-fatally.
+ */
+#ifdef PG_FLUSH_DATA_WORKS
+
+static void
+pre_sync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	/*
+	 * We do what pg_flush_data() would do in the backend: prefer to use
+	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
+	 * because this is only a hint.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#else
+#error PG_FLUSH_DATA_WORKS should not have been defined
+#endif
+
+	(void) close(fd);
+}
+
+#endif   /* PG_FLUSH_DATA_WORKS */
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required.  Reports
+ * other errors non-fatally.
+ */
+void
+fsync_fname_ext(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+
+	(void) close(fd);
+}
diff --git a/src/include/fe_utils/file_utils.h b/src/include/fe_utils/file_utils.h
new file mode 100644
index 0000000..189cc6d
--- /dev/null
+++ b/src/include/fe_utils/file_utils.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines for frontend code
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/fe_utils/file_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FILE_UTILS_H
+#define FILE_UTILS_H
+
+extern void fsync_fname_ext(const char *fname, bool isdir,
+							const char *progname);
+extern void fsync_pgdata(const char *pg_data, const char *progname);
+
+#endif   /* FILE_UTILS_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index e7268cb..e09c29d 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -118,7 +118,8 @@ sub mkvcbuild
 	our @pgcommonbkndfiles = @pgcommonallfiles;
 
 	our @pgfeutilsfiles = qw(
-	  mbprint.c print.c psqlscan.l psqlscan.c simple_list.c string_utils.c);
+	  file_utils.c mbprint.c print.c psqlscan.l psqlscan.c simple_list.c
+	  string_utils.c);
 
 	$libpgport = $solution->AddProject('libpgport', 'lib', 'misc');
 	$libpgport->AddDefine('FRONTEND');
-- 
2.8.2

0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchapplication/x-download; name=0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchDownload
From cdb73b570ba125a64fdc935994abecd7d96e5e9a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 13 May 2016 15:23:50 +0900
Subject: [PATCH 2/2] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup [-X] stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This adds a couple of routines in fe_utils which have an equivalent in the
backend:
- durable_rename
- fsync_parent_path
---
 src/bin/pg_basebackup/Makefile        |  1 +
 src/bin/pg_basebackup/pg_basebackup.c | 27 ++++++++++++
 src/bin/pg_basebackup/receivelog.c    | 56 ++++++++++++++----------
 src/fe_utils/file_utils.c             | 82 ++++++++++++++++++++++++++++++++---
 src/include/fe_utils/file_utils.h     |  5 ++-
 5 files changed, 142 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 5854672..5b9bd39 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -17,6 +17,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils
 
 OBJS=receivelog.o streamutil.o $(WIN32RES)
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 2927b60..0cda7c8 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -26,6 +26,7 @@
 #endif
 
 #include "common/string.h"
+#include "fe_utils/file_utils.h"
 #include "getopt_long.h"
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
@@ -1114,6 +1115,10 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
+
+	/* sync the resulting tar file, errors are not considered fatal */
+	if (strcmp(basedir, "-") != 0)
+		(void) fsync_fname_ext(filename, false, progname);
 }
 
 
@@ -1390,6 +1395,11 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (basetablespace && writerecoveryconf)
 		WriteRecoveryConf();
+
+	/*
+	 * No data is synced here, everything is done for all tablespaces at the
+	 * end.
+	 */
 }
 
 /*
@@ -1931,6 +1941,23 @@ BaseBackup(void)
 	PQclear(res);
 	PQfinish(conn);
 
+	/*
+	 * Make data persistent on disk once backup is completed. For tar
+	 * format once syncing the parent directory is fine, each tar file
+	 * created per tablespace has been already synced. In plain format,
+	 * all the data of the base directory is synced, taking into account
+	 * all the tablespaces. Errors are not considered fatal.
+	 */
+	if (format == 't')
+	{
+		if (strcmp(basedir, "-") != 0)
+			(void) fsync_fname_ext(basedir, true, progname);
+	}
+	else
+	{
+		(void) fsync_pgdata(basedir, progname);
+	}
+
 	if (verbose)
 		fprintf(stderr, "%s: base backup completed\n", progname);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..f528cd4 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -23,6 +23,7 @@
 
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
+#include "fe_utils/file_utils.h"
 
 
 /* fd and filename for currently open WAL file */
@@ -68,17 +69,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false, progname) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath, progname) != 0)
+		return false;
 
 	return true;
 }
@@ -116,6 +113,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +130,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false, progname) != 0)
+			return false;
+		if (fsync_parent_path(fn, progname) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +168,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false, progname) != 0)
+		return false;
+	if (fsync_parent_path(fn, progname) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +239,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn, progname) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +359,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +369,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path, progname) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
diff --git a/src/fe_utils/file_utils.c b/src/fe_utils/file_utils.c
index 80f6dbd..4fffdf1 100644
--- a/src/fe_utils/file_utils.c
+++ b/src/fe_utils/file_utils.c
@@ -34,7 +34,7 @@ static void pre_sync_fname(const char *fname, bool isdir,
 						   const char *progname);
 #endif
 static void walkdir(const char *path,
-	void (*action) (const char *fname, bool isdir, const char *progname),
+	int (*action) (const char *fname, bool isdir, const char *progname),
 	bool process_symlinks, const char *progname);
 
 /*
@@ -54,7 +54,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
 	char		pg_xlog[MAXPGPATH];
 	char		pg_tblspc[MAXPGPATH];
 
-	fputs(_("syncing data to disk ... "), stdout);
+	fputs(_("syncing data to disk ...\n"), stdout);
 	fflush(stdout);
 
 	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
@@ -123,7 +123,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
  */
 static void
 walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir, const char *progname),
+		int (*action) (const char *fname, bool isdir, const char *progname),
 		bool process_symlinks, const char *progname)
 {
 	DIR		   *dir;
@@ -231,7 +231,7 @@ pre_sync_fname(const char *fname, bool isdir, const char *progname)
  * directories on systems where that isn't allowed/required.  Reports
  * other errors non-fatally.
  */
-void
+int
 fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 {
 	int			fd;
@@ -259,10 +259,10 @@ fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 	if (fd < 0)
 	{
 		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
+			return 0;
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
-		return;
+		return -1;
 	}
 
 	returncode = fsync(fd);
@@ -272,8 +272,78 @@ fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 	 * those errors. Anything else needs to be reported.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
+		return -1;
+	}
 
 	(void) close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+int
+fsync_parent_path(const char *fname, const char *progname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true, progname) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, const char *progname)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false, progname) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false, progname) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile, progname) != 0)
+		return -1;
+
+	return 0;
 }
diff --git a/src/include/fe_utils/file_utils.h b/src/include/fe_utils/file_utils.h
index 189cc6d..698488a 100644
--- a/src/include/fe_utils/file_utils.h
+++ b/src/include/fe_utils/file_utils.h
@@ -15,8 +15,11 @@
 #ifndef FILE_UTILS_H
 #define FILE_UTILS_H
 
-extern void fsync_fname_ext(const char *fname, bool isdir,
+extern int fsync_fname_ext(const char *fname, bool isdir,
 							const char *progname);
 extern void fsync_pgdata(const char *pg_data, const char *progname);
+extern int durable_rename(const char *oldfile, const char *newfile,
+						  const char *progname);
+extern int fsync_parent_path(const char *fname, const char *progname);
 
 #endif   /* FILE_UTILS_H */
-- 
2.8.2

#2Alex Ignatov
a.ignatov@postgrespro.ru
In reply to: Michael Paquier (#1)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 13.05.2016 9:39, Michael Paquier wrote:

Hi all,

Beginning a new thread because the ext4 issues are closed, and because
pg_basebackup data durability meritates a new thread. And in short
about the problem: pg_basebackup makes no effort in being sure that
the data it backs up is on disk, which is bad... One possible
recommendation is to use initdb -S after running pg_basebackup, but
making sure that data is on disk should be done before pg_basebackup
ends.

On Thu, May 12, 2016 at 8:09 PM, I wrote:

And actually this won't fly high if there is no equivalent of
walkdir() or if the fsync()'s are not applied recursively. On master
at least the refactoring had better be done cleanly first... For the
back branches, we could just have some recursive call like
fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
you think that this should be part of fe_utils or src/common/? I'd
tend to think the latter is more adapted as there is an equivalent in
the backend. On back-branches, we could just have something like
fsync_recursively that walks though the paths. An even more simple
approach would be to fsync() individually things that have been
written, but that would suck in performance.

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

With pg_basebackup -X and pg_receivexlog, the manipulation of WAL
files is made durable by using fsync and durable_rename where needed
(credits to Andres mainly for this part).

This set of patches is aimed only at HEAD. Back-patchable versions of
this patch would need to copy fsync_pgdata and friends into
streamutil.c for example.

I am adding that to the next CF for review as a bug fix.
Regards,

Hi!
Do we have any confidence that data file is not being corrupted? I.e
contains some corrupted page? Can pg_basebackup check page checksum (db
init with initdb -k) while backing up files?

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael.paquier@gmail.com
In reply to: Alex Ignatov (#2)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Fri, May 13, 2016 at 11:49 PM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:

On 13.05.2016 9:39, Michael Paquier wrote:
Do we have any confidence that data file is not being corrupted? I.e
contains some corrupted page? Can pg_basebackup check page checksum (db init
with initdb -k) while backing up files?

That may be an idea to consider, for one class of users, though this
patch is not aimed at doing something regarding that.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#1)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 5/13/16 2:39 AM, Michael Paquier wrote:

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Why fe_utils? initdb is not a front-end program.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

Maybe there should be --nosync options like initdb has?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Eisentraut (#4)
2 attachment(s)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 5/13/16 2:39 AM, Michael Paquier wrote:

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Why fe_utils? initdb is not a front-end program.

Thinking about that, you are right. Let's move it to src/common,
frontend-only though.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

Maybe there should be --nosync options like initdb has?

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.
--
Michael

Attachments:

0001-Relocation-fsync-routines-of-initdb-into-src-common.patchapplication/x-download; name=0001-Relocation-fsync-routines-of-initdb-into-src-common.patchDownload
From 4c5623f134948443147b1a5cea4d6c7da6454378 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 2 Sep 2016 15:19:11 +0900
Subject: [PATCH 1/2] Relocation fsync routines of initdb into src/common

Those are aimed at being used by other utilities, pg_basebackup mainly,
and at some other degree by pg_dump and pg_receivexlog.
---
 src/bin/initdb/initdb.c         | 266 +-------------------------------------
 src/common/Makefile             |   2 +-
 src/common/file_utils.c         | 279 ++++++++++++++++++++++++++++++++++++++++
 src/include/common/file_utils.h |  22 ++++
 src/tools/msvc/Mkvcbuild.pm     |   2 +-
 5 files changed, 310 insertions(+), 261 deletions(-)
 create mode 100644 src/common/file_utils.c
 create mode 100644 src/include/common/file_utils.h

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9407492..0b74ff9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -61,6 +61,7 @@
 #endif
 
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/username.h"
 #include "mb/pg_wchar.h"
@@ -70,13 +71,6 @@
 #include "fe_utils/string_utils.h"
 
 
-/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
-#if defined(HAVE_SYNC_FILE_RANGE)
-#define PG_FLUSH_DATA_WORKS 1
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-#define PG_FLUSH_DATA_WORKS 1
-#endif
-
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
@@ -237,13 +231,6 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks);
-#ifdef PG_FLUSH_DATA_WORKS
-static void pre_sync_fname(const char *fname, bool isdir);
-#endif
-static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -270,7 +257,6 @@ static void load_plpgsql(FILE *cmdfd);
 static void vacuum_db(FILE *cmdfd);
 static void make_template0(FILE *cmdfd);
 static void make_postgres(FILE *cmdfd);
-static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -529,177 +515,6 @@ writefile(char *path, char **lines)
 }
 
 /*
- * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself).
- *
- * If process_symlinks is true, the action and recursion are also applied
- * to regular files and directories that are pointed to by symlinks in the
- * given directory; otherwise symlinks are ignored.  Symlinks are always
- * ignored in subdirectories, ie we intentionally don't pass down the
- * process_symlinks flag to recursive calls.
- *
- * Errors are reported but not considered fatal.
- *
- * See also walkdir in fd.c, which is a backend version of this logic.
- */
-static void
-walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
-{
-	DIR		   *dir;
-	struct dirent *de;
-
-	dir = opendir(path);
-	if (dir == NULL)
-	{
-		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-		return;
-	}
-
-	while (errno = 0, (de = readdir(dir)) != NULL)
-	{
-		char		subpath[MAXPGPATH];
-		struct stat fst;
-		int			sret;
-
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
-			continue;
-
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
-
-		if (process_symlinks)
-			sret = stat(subpath, &fst);
-		else
-			sret = lstat(subpath, &fst);
-
-		if (sret < 0)
-		{
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, subpath, strerror(errno));
-			continue;
-		}
-
-		if (S_ISREG(fst.st_mode))
-			(*action) (subpath, false);
-		else if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action, false);
-	}
-
-	if (errno)
-		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-
-	(void) closedir(dir);
-
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
-	(*action) (path, true);
-}
-
-/*
- * Hint to the OS that it should get ready to fsync() this file.
- *
- * Ignores errors trying to open unreadable files, and reports other errors
- * non-fatally.
- */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static void
-pre_sync_fname(const char *fname, bool isdir)
-{
-	int			fd;
-
-	fd = open(fname, O_RDONLY | PG_BINARY);
-
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	/*
-	 * We do what pg_flush_data() would do in the backend: prefer to use
-	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
-	 * because this is only a hint.
-	 */
-#if defined(HAVE_SYNC_FILE_RANGE)
-	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
-#else
-#error PG_FLUSH_DATA_WORKS should not have been defined
-#endif
-
-	(void) close(fd);
-}
-
-#endif   /* PG_FLUSH_DATA_WORKS */
-
-/*
- * fsync_fname_ext -- Try to fsync a file or directory
- *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required.  Reports
- * other errors non-fatally.
- */
-static void
-fsync_fname_ext(const char *fname, bool isdir)
-{
-	int			fd;
-	int			flags;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
-	 * not writable by our userid, but we assume that's OK.
-	 */
-	flags = PG_BINARY;
-	if (!isdir)
-		flags |= O_RDWR;
-	else
-		flags |= O_RDONLY;
-
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
-	fd = open(fname, flags);
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	returncode = fsync(fd);
-
-	/*
-	 * Some OSes don't allow us to fsync directories at all, so we can ignore
-	 * those errors. Anything else needs to be reported.
-	 */
-	if (returncode != 0 && !(isdir && errno == EBADF))
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-
-	(void) close(fd);
-}
-
-/*
  * Open a subcommand with suitable error messaging
  */
 static FILE *
@@ -2276,77 +2091,6 @@ make_postgres(FILE *cmdfd)
 		PG_CMD_PUTS(*line);
 }
 
-/*
- * Issue fsync recursively on PGDATA and all its contents.
- *
- * We fsync regular files and directories wherever they are, but we
- * follow symlinks only for pg_xlog and immediately under pg_tblspc.
- * Other symlinks are presumed to point at files we're not responsible
- * for fsyncing, and might not have privileges to write at all.
- *
- * Errors are reported but not considered fatal.
- */
-static void
-fsync_pgdata(void)
-{
-	bool		xlog_is_symlink;
-	char		pg_xlog[MAXPGPATH];
-	char		pg_tblspc[MAXPGPATH];
-
-	fputs(_("syncing data to disk ... "), stdout);
-	fflush(stdout);
-
-	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
-	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
-
-	/*
-	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
-	 * because the first walkdir below will ignore it.
-	 */
-	xlog_is_symlink = false;
-
-#ifndef WIN32
-	{
-		struct stat st;
-
-		if (lstat(pg_xlog, &st) < 0)
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, pg_xlog, strerror(errno));
-		else if (S_ISLNK(st.st_mode))
-			xlog_is_symlink = true;
-	}
-#else
-	if (pgwin32_is_junction(pg_xlog))
-		xlog_is_symlink = true;
-#endif
-
-	/*
-	 * If possible, hint to the kernel that we're soon going to fsync the data
-	 * directory and its contents.
-	 */
-#ifdef PG_FLUSH_DATA_WORKS
-	walkdir(pg_data, pre_sync_fname, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, pre_sync_fname, false);
-	walkdir(pg_tblspc, pre_sync_fname, true);
-#endif
-
-	/*
-	 * Now we do the fsync()s in the same order.
-	 *
-	 * The main call ignores symlinks, so in addition to specially processing
-	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
-	 * process_symlinks = true.  Note that if there are any plain directories
-	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
-	 * so we don't worry about optimizing it.
-	 */
-	walkdir(pg_data, fsync_fname_ext, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, fsync_fname_ext, false);
-	walkdir(pg_tblspc, fsync_fname_ext, true);
-
-	check_ok();
-}
 
 
 /*
@@ -3512,7 +3256,8 @@ main(int argc, char *argv[])
 			exit_nicely();
 		}
 
-		fsync_pgdata();
+		fsync_pgdata(pg_data, progname);
+		check_ok();
 		return 0;
 	}
 
@@ -3574,7 +3319,10 @@ main(int argc, char *argv[])
 	initialize_data_directory();
 
 	if (do_sync)
-		fsync_pgdata();
+	{
+		fsync_pgdata(pg_data, progname);
+		check_ok();
+	}
 	else
 		printf(_("\nSync to disk skipped.\nThe data directory might become corrupt if the operating system crashes.\n"));
 
diff --git a/src/common/Makefile b/src/common/Makefile
index 72b7369..17e5316 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -40,7 +40,7 @@ OBJS_COMMON = config_info.o controldata_utils.o exec.o keywords.o \
 	pg_lzcompress.o pgfnames.o psprintf.o relpath.o rmtree.o \
 	string.o username.o wait_error.o
 
-OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o restricted_token.o
+OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o file_utils.o restricted_token.o
 
 OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
new file mode 100644
index 0000000..8b97dcb
--- /dev/null
+++ b/src/common/file_utils.c
@@ -0,0 +1,279 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines.
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/common/file_utils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_utils.h"
+
+
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir,
+						   const char *progname);
+#endif
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir, const char *progname),
+	bool process_symlinks, const char *progname);
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and immediately under pg_tblspc.
+ * Other symlinks are presumed to point at files we're not responsible
+ * for fsyncing, and might not have privileges to write at all.
+ *
+ * Errors are reported but not considered fatal.
+ */
+void
+fsync_pgdata(const char *pg_data, const char *progname)
+{
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+
+	fputs(_("syncing data to disk ... "), stdout);
+	fflush(stdout);
+
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
+
+	/*
+	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+#ifndef WIN32
+	{
+		struct stat st;
+
+		if (lstat(pg_xlog, &st) < 0)
+			fprintf(stderr, _("could not stat file \"%s\": %s\n"),
+					pg_xlog, strerror(errno));
+		else if (S_ISLNK(st.st_mode))
+			xlog_is_symlink = true;
+	}
+#else
+	if (pgwin32_is_junction(pg_xlog))
+		xlog_is_symlink = true;
+#endif
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync the data
+	 * directory and its contents.
+	 */
+#ifdef PG_FLUSH_DATA_WORKS
+	walkdir(pg_data, pre_sync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, pre_sync_fname, false, progname);
+	walkdir(pg_tblspc, pre_sync_fname, true, progname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * The main call ignores symlinks, so in addition to specially processing
+	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
+	 * process_symlinks = true.  Note that if there are any plain directories
+	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
+	 * so we don't worry about optimizing it.
+	 */
+	walkdir(pg_data, fsync_fname_ext, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, fsync_fname_ext, false, progname);
+	walkdir(pg_tblspc, fsync_fname_ext, true, progname);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * If process_symlinks is true, the action and recursion are also applied
+ * to regular files and directories that are pointed to by symlinks in the
+ * given directory; otherwise symlinks are ignored.  Symlinks are always
+ * ignored in subdirectories, ie we intentionally don't pass down the
+ * process_symlinks flag to recursive calls.
+ *
+ * Errors are reported but not considered fatal.
+ *
+ * See also walkdir in fd.c, which is a backend version of this logic.
+ */
+static void
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir, const char *progname),
+		bool process_symlinks, const char *progname)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+		return;
+	}
+
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+		int			sret;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, &fst);
+		else
+			sret = lstat(subpath, &fst);
+
+		if (sret < 0)
+		{
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, subpath, strerror(errno));
+			continue;
+		}
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false, progname);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false, progname);
+	}
+
+	if (errno)
+		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+
+	(void) closedir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true, progname);
+}
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Ignores errors trying to open unreadable files, and reports other errors
+ * non-fatally.
+ */
+#ifdef PG_FLUSH_DATA_WORKS
+
+static void
+pre_sync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	/*
+	 * We do what pg_flush_data() would do in the backend: prefer to use
+	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
+	 * because this is only a hint.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#else
+#error PG_FLUSH_DATA_WORKS should not have been defined
+#endif
+
+	(void) close(fd);
+}
+
+#endif   /* PG_FLUSH_DATA_WORKS */
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required.  Reports
+ * other errors non-fatally.
+ */
+void
+fsync_fname_ext(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+
+	(void) close(fd);
+}
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
new file mode 100644
index 0000000..9ff7c40
--- /dev/null
+++ b/src/include/common/file_utils.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines for frontend code
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/common/file_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FILE_UTILS_H
+#define FILE_UTILS_H
+
+extern void fsync_fname_ext(const char *fname, bool isdir,
+							const char *progname);
+extern void fsync_pgdata(const char *pg_data, const char *progname);
+
+#endif   /* FILE_UTILS_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 6746728..dad36b3 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -115,7 +115,7 @@ sub mkvcbuild
 	  string.c username.c wait_error.c);
 
 	our @pgcommonfrontendfiles = (
-		@pgcommonallfiles, qw(fe_memutils.c
+		@pgcommonallfiles, qw(file_utils.c fe_memutils.c
 		  restricted_token.c));
 
 	our @pgcommonbkndfiles = @pgcommonallfiles;
-- 
2.9.3

0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchapplication/x-download; name=0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchDownload
From 73f3e35b19475546f4f9202d326595a56eba3847 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 2 Sep 2016 15:36:08 +0900
Subject: [PATCH 2/2] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup [-X] stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This adds a couple of routines in fe_utils which have an equivalent in the
backend:
- durable_rename
- fsync_parent_path
---
 src/bin/pg_basebackup/pg_basebackup.c | 27 ++++++++++++
 src/bin/pg_basebackup/receivelog.c    | 56 ++++++++++++++----------
 src/common/file_utils.c               | 82 ++++++++++++++++++++++++++++++++---
 src/include/common/file_utils.h       |  5 ++-
 4 files changed, 141 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 351a420..a41b2e7 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,7 @@
 #include <zlib.h>
 #endif
 
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
@@ -1115,6 +1116,10 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
+
+	/* sync the resulting tar file, errors are not considered fatal */
+	if (strcmp(basedir, "-") != 0)
+		(void) fsync_fname_ext(filename, false, progname);
 }
 
 
@@ -1391,6 +1396,11 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (basetablespace && writerecoveryconf)
 		WriteRecoveryConf();
+
+	/*
+	 * No data is synced here, everything is done for all tablespaces at the
+	 * end.
+	 */
 }
 
 /*
@@ -1869,6 +1879,23 @@ BaseBackup(void)
 	PQclear(res);
 	PQfinish(conn);
 
+	/*
+	 * Make data persistent on disk once backup is completed. For tar
+	 * format once syncing the parent directory is fine, each tar file
+	 * created per tablespace has been already synced. In plain format,
+	 * all the data of the base directory is synced, taking into account
+	 * all the tablespaces. Errors are not considered fatal.
+	 */
+	if (format == 't')
+	{
+		if (strcmp(basedir, "-") != 0)
+			(void) fsync_fname_ext(basedir, true, progname);
+	}
+	else
+	{
+		(void) fsync_pgdata(basedir, progname);
+	}
+
 	if (verbose)
 		fprintf(stderr, "%s: base backup completed\n", progname);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 062730b..8634a91 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -23,6 +23,7 @@
 
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
+#include "common/file_utils.h"
 
 
 /* fd and filename for currently open WAL file */
@@ -68,17 +69,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false, progname) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath, progname) != 0)
+		return false;
 
 	return true;
 }
@@ -116,6 +113,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +130,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false, progname) != 0)
+			return false;
+		if (fsync_parent_path(fn, progname) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +168,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false, progname) != 0)
+		return false;
+	if (fsync_parent_path(fn, progname) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +239,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn, progname) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +359,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +369,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path, progname) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 8b97dcb..3a6e7db 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -34,7 +34,7 @@ static void pre_sync_fname(const char *fname, bool isdir,
 						   const char *progname);
 #endif
 static void walkdir(const char *path,
-	void (*action) (const char *fname, bool isdir, const char *progname),
+	int (*action) (const char *fname, bool isdir, const char *progname),
 	bool process_symlinks, const char *progname);
 
 /*
@@ -54,7 +54,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
 	char		pg_xlog[MAXPGPATH];
 	char		pg_tblspc[MAXPGPATH];
 
-	fputs(_("syncing data to disk ... "), stdout);
+	fputs(_("syncing data to disk ...\n"), stdout);
 	fflush(stdout);
 
 	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
@@ -123,7 +123,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
  */
 static void
 walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir, const char *progname),
+		int (*action) (const char *fname, bool isdir, const char *progname),
 		bool process_symlinks, const char *progname)
 {
 	DIR		   *dir;
@@ -231,7 +231,7 @@ pre_sync_fname(const char *fname, bool isdir, const char *progname)
  * directories on systems where that isn't allowed/required.  Reports
  * other errors non-fatally.
  */
-void
+int
 fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 {
 	int			fd;
@@ -259,10 +259,10 @@ fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 	if (fd < 0)
 	{
 		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
+			return 0;
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
-		return;
+		return -1;
 	}
 
 	returncode = fsync(fd);
@@ -272,8 +272,78 @@ fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 	 * those errors. Anything else needs to be reported.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
+		return -1;
+	}
 
 	(void) close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+int
+fsync_parent_path(const char *fname, const char *progname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true, progname) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, const char *progname)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false, progname) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false, progname) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile, progname) != 0)
+		return -1;
+
+	return 0;
 }
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 9ff7c40..4e5d8b3 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -15,8 +15,11 @@
 #ifndef FILE_UTILS_H
 #define FILE_UTILS_H
 
-extern void fsync_fname_ext(const char *fname, bool isdir,
+extern int fsync_fname_ext(const char *fname, bool isdir,
 							const char *progname);
 extern void fsync_pgdata(const char *pg_data, const char *progname);
+extern int durable_rename(const char *oldfile, const char *newfile,
+						  const char *progname);
+extern int fsync_parent_path(const char *fname, const char *progname);
 
 #endif   /* FILE_UTILS_H */
-- 
2.9.3

#6Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#5)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 5/13/16 2:39 AM, Michael Paquier wrote:

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Why fe_utils? initdb is not a front-end program.

Thinking about that, you are right. Let's move it to src/common,
frontend-only though.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

Maybe there should be --nosync options like initdb has?

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.

Well, why do we provide a --nosync option for initdb? Wouldn't the argument
basically be the same?

I agree it kind of feels like overkill, but it would be consistent
overkill? :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#6)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 5/13/16 2:39 AM, Michael Paquier wrote:

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.

Well, why do we provide a --nosync option for initdb? Wouldn't the argument
basically be the same?

Yes, the good-for-testing-but-not-production argument.

I agree it kind of feels like overkill, but it would be consistent overkill?
:)

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#7)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 5/13/16 2:39 AM, Michael Paquier wrote:

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.

Well, why do we provide a --nosync option for initdb? Wouldn't the argument
basically be the same?

Yes, the good-for-testing-but-not-production argument.

I agree it kind of feels like overkill, but it would be consistent overkill?
:)

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?

With patches that's actually better..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#8)
3 attachment(s)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 3, 2016 at 10:22 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?

With patches that's actually better..

Meh, meh et meh.
--
Michael

Attachments:

0001-Relocation-fsync-routines-of-initdb-into-src-common.patchapplication/x-patch; name=0001-Relocation-fsync-routines-of-initdb-into-src-common.patchDownload
From be6888046cc7dcfde33c22294e8d94a9369ff6b5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 2 Sep 2016 15:19:11 +0900
Subject: [PATCH 1/3] Relocation fsync routines of initdb into src/common

Those are aimed at being used by other utilities, pg_basebackup mainly,
and at some other degree by pg_dump and pg_receivexlog.
---
 src/bin/initdb/initdb.c         | 266 +-------------------------------------
 src/common/Makefile             |   2 +-
 src/common/file_utils.c         | 279 ++++++++++++++++++++++++++++++++++++++++
 src/include/common/file_utils.h |  22 ++++
 src/tools/msvc/Mkvcbuild.pm     |   2 +-
 5 files changed, 310 insertions(+), 261 deletions(-)
 create mode 100644 src/common/file_utils.c
 create mode 100644 src/include/common/file_utils.h

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9407492..0b74ff9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -61,6 +61,7 @@
 #endif
 
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/username.h"
 #include "mb/pg_wchar.h"
@@ -70,13 +71,6 @@
 #include "fe_utils/string_utils.h"
 
 
-/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
-#if defined(HAVE_SYNC_FILE_RANGE)
-#define PG_FLUSH_DATA_WORKS 1
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-#define PG_FLUSH_DATA_WORKS 1
-#endif
-
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
@@ -237,13 +231,6 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks);
-#ifdef PG_FLUSH_DATA_WORKS
-static void pre_sync_fname(const char *fname, bool isdir);
-#endif
-static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -270,7 +257,6 @@ static void load_plpgsql(FILE *cmdfd);
 static void vacuum_db(FILE *cmdfd);
 static void make_template0(FILE *cmdfd);
 static void make_postgres(FILE *cmdfd);
-static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -529,177 +515,6 @@ writefile(char *path, char **lines)
 }
 
 /*
- * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself).
- *
- * If process_symlinks is true, the action and recursion are also applied
- * to regular files and directories that are pointed to by symlinks in the
- * given directory; otherwise symlinks are ignored.  Symlinks are always
- * ignored in subdirectories, ie we intentionally don't pass down the
- * process_symlinks flag to recursive calls.
- *
- * Errors are reported but not considered fatal.
- *
- * See also walkdir in fd.c, which is a backend version of this logic.
- */
-static void
-walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
-{
-	DIR		   *dir;
-	struct dirent *de;
-
-	dir = opendir(path);
-	if (dir == NULL)
-	{
-		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-		return;
-	}
-
-	while (errno = 0, (de = readdir(dir)) != NULL)
-	{
-		char		subpath[MAXPGPATH];
-		struct stat fst;
-		int			sret;
-
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
-			continue;
-
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
-
-		if (process_symlinks)
-			sret = stat(subpath, &fst);
-		else
-			sret = lstat(subpath, &fst);
-
-		if (sret < 0)
-		{
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, subpath, strerror(errno));
-			continue;
-		}
-
-		if (S_ISREG(fst.st_mode))
-			(*action) (subpath, false);
-		else if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action, false);
-	}
-
-	if (errno)
-		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-
-	(void) closedir(dir);
-
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
-	(*action) (path, true);
-}
-
-/*
- * Hint to the OS that it should get ready to fsync() this file.
- *
- * Ignores errors trying to open unreadable files, and reports other errors
- * non-fatally.
- */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static void
-pre_sync_fname(const char *fname, bool isdir)
-{
-	int			fd;
-
-	fd = open(fname, O_RDONLY | PG_BINARY);
-
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	/*
-	 * We do what pg_flush_data() would do in the backend: prefer to use
-	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
-	 * because this is only a hint.
-	 */
-#if defined(HAVE_SYNC_FILE_RANGE)
-	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
-#else
-#error PG_FLUSH_DATA_WORKS should not have been defined
-#endif
-
-	(void) close(fd);
-}
-
-#endif   /* PG_FLUSH_DATA_WORKS */
-
-/*
- * fsync_fname_ext -- Try to fsync a file or directory
- *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required.  Reports
- * other errors non-fatally.
- */
-static void
-fsync_fname_ext(const char *fname, bool isdir)
-{
-	int			fd;
-	int			flags;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
-	 * not writable by our userid, but we assume that's OK.
-	 */
-	flags = PG_BINARY;
-	if (!isdir)
-		flags |= O_RDWR;
-	else
-		flags |= O_RDONLY;
-
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
-	fd = open(fname, flags);
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	returncode = fsync(fd);
-
-	/*
-	 * Some OSes don't allow us to fsync directories at all, so we can ignore
-	 * those errors. Anything else needs to be reported.
-	 */
-	if (returncode != 0 && !(isdir && errno == EBADF))
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-
-	(void) close(fd);
-}
-
-/*
  * Open a subcommand with suitable error messaging
  */
 static FILE *
@@ -2276,77 +2091,6 @@ make_postgres(FILE *cmdfd)
 		PG_CMD_PUTS(*line);
 }
 
-/*
- * Issue fsync recursively on PGDATA and all its contents.
- *
- * We fsync regular files and directories wherever they are, but we
- * follow symlinks only for pg_xlog and immediately under pg_tblspc.
- * Other symlinks are presumed to point at files we're not responsible
- * for fsyncing, and might not have privileges to write at all.
- *
- * Errors are reported but not considered fatal.
- */
-static void
-fsync_pgdata(void)
-{
-	bool		xlog_is_symlink;
-	char		pg_xlog[MAXPGPATH];
-	char		pg_tblspc[MAXPGPATH];
-
-	fputs(_("syncing data to disk ... "), stdout);
-	fflush(stdout);
-
-	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
-	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
-
-	/*
-	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
-	 * because the first walkdir below will ignore it.
-	 */
-	xlog_is_symlink = false;
-
-#ifndef WIN32
-	{
-		struct stat st;
-
-		if (lstat(pg_xlog, &st) < 0)
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, pg_xlog, strerror(errno));
-		else if (S_ISLNK(st.st_mode))
-			xlog_is_symlink = true;
-	}
-#else
-	if (pgwin32_is_junction(pg_xlog))
-		xlog_is_symlink = true;
-#endif
-
-	/*
-	 * If possible, hint to the kernel that we're soon going to fsync the data
-	 * directory and its contents.
-	 */
-#ifdef PG_FLUSH_DATA_WORKS
-	walkdir(pg_data, pre_sync_fname, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, pre_sync_fname, false);
-	walkdir(pg_tblspc, pre_sync_fname, true);
-#endif
-
-	/*
-	 * Now we do the fsync()s in the same order.
-	 *
-	 * The main call ignores symlinks, so in addition to specially processing
-	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
-	 * process_symlinks = true.  Note that if there are any plain directories
-	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
-	 * so we don't worry about optimizing it.
-	 */
-	walkdir(pg_data, fsync_fname_ext, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, fsync_fname_ext, false);
-	walkdir(pg_tblspc, fsync_fname_ext, true);
-
-	check_ok();
-}
 
 
 /*
@@ -3512,7 +3256,8 @@ main(int argc, char *argv[])
 			exit_nicely();
 		}
 
-		fsync_pgdata();
+		fsync_pgdata(pg_data, progname);
+		check_ok();
 		return 0;
 	}
 
@@ -3574,7 +3319,10 @@ main(int argc, char *argv[])
 	initialize_data_directory();
 
 	if (do_sync)
-		fsync_pgdata();
+	{
+		fsync_pgdata(pg_data, progname);
+		check_ok();
+	}
 	else
 		printf(_("\nSync to disk skipped.\nThe data directory might become corrupt if the operating system crashes.\n"));
 
diff --git a/src/common/Makefile b/src/common/Makefile
index a5fa649..03dfaa1 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -44,7 +44,7 @@ OBJS_COMMON = config_info.o controldata_utils.o exec.o ip.o keywords.o \
 	md5.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o rmtree.o \
 	string.o username.o wait_error.o
 
-OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o restricted_token.o
+OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o file_utils.o restricted_token.o
 
 OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
new file mode 100644
index 0000000..8b97dcb
--- /dev/null
+++ b/src/common/file_utils.c
@@ -0,0 +1,279 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines.
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/common/file_utils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_utils.h"
+
+
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir,
+						   const char *progname);
+#endif
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir, const char *progname),
+	bool process_symlinks, const char *progname);
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and immediately under pg_tblspc.
+ * Other symlinks are presumed to point at files we're not responsible
+ * for fsyncing, and might not have privileges to write at all.
+ *
+ * Errors are reported but not considered fatal.
+ */
+void
+fsync_pgdata(const char *pg_data, const char *progname)
+{
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+
+	fputs(_("syncing data to disk ... "), stdout);
+	fflush(stdout);
+
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
+
+	/*
+	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+#ifndef WIN32
+	{
+		struct stat st;
+
+		if (lstat(pg_xlog, &st) < 0)
+			fprintf(stderr, _("could not stat file \"%s\": %s\n"),
+					pg_xlog, strerror(errno));
+		else if (S_ISLNK(st.st_mode))
+			xlog_is_symlink = true;
+	}
+#else
+	if (pgwin32_is_junction(pg_xlog))
+		xlog_is_symlink = true;
+#endif
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync the data
+	 * directory and its contents.
+	 */
+#ifdef PG_FLUSH_DATA_WORKS
+	walkdir(pg_data, pre_sync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, pre_sync_fname, false, progname);
+	walkdir(pg_tblspc, pre_sync_fname, true, progname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * The main call ignores symlinks, so in addition to specially processing
+	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
+	 * process_symlinks = true.  Note that if there are any plain directories
+	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
+	 * so we don't worry about optimizing it.
+	 */
+	walkdir(pg_data, fsync_fname_ext, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, fsync_fname_ext, false, progname);
+	walkdir(pg_tblspc, fsync_fname_ext, true, progname);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * If process_symlinks is true, the action and recursion are also applied
+ * to regular files and directories that are pointed to by symlinks in the
+ * given directory; otherwise symlinks are ignored.  Symlinks are always
+ * ignored in subdirectories, ie we intentionally don't pass down the
+ * process_symlinks flag to recursive calls.
+ *
+ * Errors are reported but not considered fatal.
+ *
+ * See also walkdir in fd.c, which is a backend version of this logic.
+ */
+static void
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir, const char *progname),
+		bool process_symlinks, const char *progname)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+		return;
+	}
+
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+		int			sret;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, &fst);
+		else
+			sret = lstat(subpath, &fst);
+
+		if (sret < 0)
+		{
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, subpath, strerror(errno));
+			continue;
+		}
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false, progname);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false, progname);
+	}
+
+	if (errno)
+		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+
+	(void) closedir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true, progname);
+}
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Ignores errors trying to open unreadable files, and reports other errors
+ * non-fatally.
+ */
+#ifdef PG_FLUSH_DATA_WORKS
+
+static void
+pre_sync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	/*
+	 * We do what pg_flush_data() would do in the backend: prefer to use
+	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
+	 * because this is only a hint.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#else
+#error PG_FLUSH_DATA_WORKS should not have been defined
+#endif
+
+	(void) close(fd);
+}
+
+#endif   /* PG_FLUSH_DATA_WORKS */
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required.  Reports
+ * other errors non-fatally.
+ */
+void
+fsync_fname_ext(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+
+	(void) close(fd);
+}
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
new file mode 100644
index 0000000..9ff7c40
--- /dev/null
+++ b/src/include/common/file_utils.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines for frontend code
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/common/file_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FILE_UTILS_H
+#define FILE_UTILS_H
+
+extern void fsync_fname_ext(const char *fname, bool isdir,
+							const char *progname);
+extern void fsync_pgdata(const char *pg_data, const char *progname);
+
+#endif   /* FILE_UTILS_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index b3ed1f5..f9edc22 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -115,7 +115,7 @@ sub mkvcbuild
 	  string.c username.c wait_error.c);
 
 	our @pgcommonfrontendfiles = (
-		@pgcommonallfiles, qw(fe_memutils.c
+		@pgcommonallfiles, qw(file_utils.c fe_memutils.c
 		  restricted_token.c));
 
 	our @pgcommonbkndfiles = @pgcommonallfiles;
-- 
2.9.3

0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchapplication/x-patch; name=0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchDownload
From 424dce3f8af832bd164456e33126eb09326e9bfa Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 2 Sep 2016 15:36:08 +0900
Subject: [PATCH 2/3] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup [-X] stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This adds a couple of routines in fe_utils which have an equivalent in the
backend:
- durable_rename
- fsync_parent_path
---
 src/bin/pg_basebackup/pg_basebackup.c | 27 ++++++++++++
 src/bin/pg_basebackup/receivelog.c    | 56 ++++++++++++++----------
 src/common/file_utils.c               | 82 ++++++++++++++++++++++++++++++++---
 src/include/common/file_utils.h       |  5 ++-
 4 files changed, 141 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 351a420..a41b2e7 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,7 @@
 #include <zlib.h>
 #endif
 
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
@@ -1115,6 +1116,10 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
+
+	/* sync the resulting tar file, errors are not considered fatal */
+	if (strcmp(basedir, "-") != 0)
+		(void) fsync_fname_ext(filename, false, progname);
 }
 
 
@@ -1391,6 +1396,11 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (basetablespace && writerecoveryconf)
 		WriteRecoveryConf();
+
+	/*
+	 * No data is synced here, everything is done for all tablespaces at the
+	 * end.
+	 */
 }
 
 /*
@@ -1869,6 +1879,23 @@ BaseBackup(void)
 	PQclear(res);
 	PQfinish(conn);
 
+	/*
+	 * Make data persistent on disk once backup is completed. For tar
+	 * format once syncing the parent directory is fine, each tar file
+	 * created per tablespace has been already synced. In plain format,
+	 * all the data of the base directory is synced, taking into account
+	 * all the tablespaces. Errors are not considered fatal.
+	 */
+	if (format == 't')
+	{
+		if (strcmp(basedir, "-") != 0)
+			(void) fsync_fname_ext(basedir, true, progname);
+	}
+	else
+	{
+		(void) fsync_pgdata(basedir, progname);
+	}
+
 	if (verbose)
 		fprintf(stderr, "%s: base backup completed\n", progname);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 062730b..8634a91 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -23,6 +23,7 @@
 
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
+#include "common/file_utils.h"
 
 
 /* fd and filename for currently open WAL file */
@@ -68,17 +69,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false, progname) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath, progname) != 0)
+		return false;
 
 	return true;
 }
@@ -116,6 +113,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +130,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false, progname) != 0)
+			return false;
+		if (fsync_parent_path(fn, progname) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +168,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false, progname) != 0)
+		return false;
+	if (fsync_parent_path(fn, progname) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +239,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn, progname) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +359,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +369,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path, progname) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 8b97dcb..3a6e7db 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -34,7 +34,7 @@ static void pre_sync_fname(const char *fname, bool isdir,
 						   const char *progname);
 #endif
 static void walkdir(const char *path,
-	void (*action) (const char *fname, bool isdir, const char *progname),
+	int (*action) (const char *fname, bool isdir, const char *progname),
 	bool process_symlinks, const char *progname);
 
 /*
@@ -54,7 +54,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
 	char		pg_xlog[MAXPGPATH];
 	char		pg_tblspc[MAXPGPATH];
 
-	fputs(_("syncing data to disk ... "), stdout);
+	fputs(_("syncing data to disk ...\n"), stdout);
 	fflush(stdout);
 
 	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
@@ -123,7 +123,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
  */
 static void
 walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir, const char *progname),
+		int (*action) (const char *fname, bool isdir, const char *progname),
 		bool process_symlinks, const char *progname)
 {
 	DIR		   *dir;
@@ -231,7 +231,7 @@ pre_sync_fname(const char *fname, bool isdir, const char *progname)
  * directories on systems where that isn't allowed/required.  Reports
  * other errors non-fatally.
  */
-void
+int
 fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 {
 	int			fd;
@@ -259,10 +259,10 @@ fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 	if (fd < 0)
 	{
 		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
+			return 0;
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
-		return;
+		return -1;
 	}
 
 	returncode = fsync(fd);
@@ -272,8 +272,78 @@ fsync_fname_ext(const char *fname, bool isdir, const char *progname)
 	 * those errors. Anything else needs to be reported.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
+		return -1;
+	}
 
 	(void) close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+int
+fsync_parent_path(const char *fname, const char *progname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true, progname) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, const char *progname)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false, progname) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false, progname) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile, progname) != 0)
+		return -1;
+
+	return 0;
 }
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 9ff7c40..4e5d8b3 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -15,8 +15,11 @@
 #ifndef FILE_UTILS_H
 #define FILE_UTILS_H
 
-extern void fsync_fname_ext(const char *fname, bool isdir,
+extern int fsync_fname_ext(const char *fname, bool isdir,
 							const char *progname);
 extern void fsync_pgdata(const char *pg_data, const char *progname);
+extern int durable_rename(const char *oldfile, const char *newfile,
+						  const char *progname);
+extern int fsync_parent_path(const char *fname, const char *progname);
 
 #endif   /* FILE_UTILS_H */
-- 
2.9.3

0003-Add-nosync-option-to-pg_basebackup.patchapplication/x-patch; name=0003-Add-nosync-option-to-pg_basebackup.patchDownload
From 96ea7b7bf5484940258c8caef38a1301c1f0fd28 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Sat, 3 Sep 2016 22:18:22 +0900
Subject: [PATCH 3/3] Add --nosync option to pg_basebackup

This makes pg_basebackup, and is useful for testing.
---
 doc/src/sgml/ref/pg_basebackup.sgml    | 15 +++++++++++++++
 src/bin/pg_basebackup/pg_basebackup.c  | 28 +++++++++++++++++++---------
 src/bin/pg_basebackup/pg_receivexlog.c |  1 +
 src/bin/pg_basebackup/receivelog.c     | 20 +++++++++++---------
 src/bin/pg_basebackup/receivelog.h     |  4 +++-
 5 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 03615da..50195fd 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -421,6 +421,21 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-N</option></term>
+      <term><option>--nosync</option></term>
+      <listitem>
+       <para>
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.  This option causes
+        <command>pg_basebackup</command> to return without waiting, which is
+        faster, but means that a subsequent operating system crash can leave
+        the base backup corrupt.  Generally, this option is useful for testing
+        but should not be used when creating a production installation.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-v</option></term>
       <term><option>--verbose</option></term>
       <listitem>
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index a41b2e7..b3d891c 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -66,6 +66,7 @@ static bool includewal = false;
 static bool streamwal = false;
 static bool fastcheckpoint = false;
 static bool writerecoveryconf = false;
+static bool do_sync = true;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static pg_time_t last_progress_report = 0;
 static int32 maxrate = 0;		/* no limit by default */
@@ -254,6 +255,7 @@ usage(void)
 	printf(_("  -c, --checkpoint=fast|spread\n"
 			 "                         set fast or spread checkpointing\n"));
 	printf(_("  -l, --label=LABEL      set backup label\n"));
+	printf(_("  -N, --nosync           do not wait for changes to be written safely to disk\n"));
 	printf(_("  -P, --progress         show progress information\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
@@ -383,6 +385,7 @@ LogStreamerMain(logstreamer_param *param)
 	stream.stream_stop = reached_end_position;
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = false;
+	stream.do_sync = do_sync;
 	stream.mark_done = true;
 	stream.basedir = param->xlogdir;
 	stream.partial_suffix = NULL;
@@ -1118,7 +1121,7 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		PQfreemem(copybuf);
 
 	/* sync the resulting tar file, errors are not considered fatal */
-	if (strcmp(basedir, "-") != 0)
+	if (do_sync && strcmp(basedir, "-") != 0)
 		(void) fsync_fname_ext(filename, false, progname);
 }
 
@@ -1886,14 +1889,17 @@ BaseBackup(void)
 	 * all the data of the base directory is synced, taking into account
 	 * all the tablespaces. Errors are not considered fatal.
 	 */
-	if (format == 't')
+	if (do_sync)
 	{
-		if (strcmp(basedir, "-") != 0)
-			(void) fsync_fname_ext(basedir, true, progname);
-	}
-	else
-	{
-		(void) fsync_pgdata(basedir, progname);
+		if (format == 't')
+		{
+			if (strcmp(basedir, "-") != 0)
+				(void) fsync_fname_ext(basedir, true, progname);
+		}
+		else
+		{
+			(void) fsync_pgdata(basedir, progname);
+		}
 	}
 
 	if (verbose)
@@ -1928,6 +1934,7 @@ main(int argc, char **argv)
 		{"status-interval", required_argument, NULL, 's'},
 		{"verbose", no_argument, NULL, 'v'},
 		{"progress", no_argument, NULL, 'P'},
+		{"nosync", no_argument, NULL, 'N'},
 		{"xlogdir", required_argument, NULL, 1},
 		{NULL, 0, NULL, 0}
 	};
@@ -1953,7 +1960,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:zZ:d:c:h:p:U:s:S:wWvP",
+	while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:NzZ:d:c:h:p:U:s:S:wWvP",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -1977,6 +1984,9 @@ main(int argc, char **argv)
 			case 'r':
 				maxrate = parse_max_rate(optarg);
 				break;
+			case 'N':
+				do_sync = false;
+				break;
 			case 'R':
 				writerecoveryconf = true;
 				break;
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 7f7ee9d..a58a251 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -336,6 +336,7 @@ StreamLog(void)
 	stream.stream_stop = stop_streaming;
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = synchronous;
+	stream.do_sync = true;
 	stream.mark_done = false;
 	stream.basedir = basedir;
 	stream.partial_suffix = ".partial";
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8634a91..f7a4397 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -53,7 +53,7 @@ static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
 
 static bool
-mark_file_as_archived(const char *basedir, const char *fname)
+mark_file_as_archived(const char *basedir, const char *fname, bool do_sync)
 {
 	int			fd;
 	static char tmppath[MAXPGPATH];
@@ -71,10 +71,10 @@ mark_file_as_archived(const char *basedir, const char *fname)
 
 	close(fd);
 
-	if (fsync_fname_ext(tmppath, false, progname) != 0)
+	if (do_sync && fsync_fname_ext(tmppath, false, progname) != 0)
 		return false;
 
-	if (fsync_parent_path(tmppath, progname) != 0)
+	if (do_sync && fsync_parent_path(tmppath, progname) != 0)
 		return false;
 
 	return true;
@@ -135,9 +135,9 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 		 * fsync, in case of a previous crash between padding and fsyncing the
 		 * file.
 		 */
-		if (fsync_fname_ext(fn, false, progname) != 0)
+		if (stream->do_sync && fsync_fname_ext(fn, false, progname) != 0)
 			return false;
-		if (fsync_parent_path(fn, progname) != 0)
+		if (stream->do_sync && fsync_parent_path(fn, progname) != 0)
 			return false;
 
 		return true;
@@ -174,9 +174,9 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	 * using synchronous mode, where the file is modified and fsynced
 	 * in-place, without a directory fsync.
 	 */
-	if (fsync_fname_ext(fn, false, progname) != 0)
+	if (stream->do_sync && fsync_fname_ext(fn, false, progname) != 0)
 		return false;
-	if (fsync_parent_path(fn, progname) != 0)
+	if (stream->do_sync && fsync_parent_path(fn, progname) != 0)
 		return false;
 
 	if (lseek(f, SEEK_SET, 0) != 0)
@@ -259,7 +259,8 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 	if (currpos == XLOG_SEG_SIZE && stream->mark_done)
 	{
 		/* writes error message if failed */
-		if (!mark_file_as_archived(stream->basedir, current_walfile_name))
+		if (!mark_file_as_archived(stream->basedir, current_walfile_name,
+								   stream->do_sync))
 			return false;
 	}
 
@@ -379,7 +380,8 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	if (stream->mark_done)
 	{
 		/* writes error message if failed */
-		if (!mark_file_as_archived(stream->basedir, histfname))
+		if (!mark_file_as_archived(stream->basedir, histfname,
+								   stream->do_sync))
 			return false;
 	}
 
diff --git a/src/bin/pg_basebackup/receivelog.h b/src/bin/pg_basebackup/receivelog.h
index 554ff8b..7a3bbc5 100644
--- a/src/bin/pg_basebackup/receivelog.h
+++ b/src/bin/pg_basebackup/receivelog.h
@@ -34,8 +34,10 @@ typedef struct StreamCtl
 								 * timeline */
 	int			standby_message_timeout;		/* Send status messages this
 												 * often */
-	bool		synchronous;	/* Flush data on write */
+	bool		synchronous;	/* Flush immediately WAL data on write */
 	bool		mark_done;		/* Mark segment as done in generated archive */
+	bool		do_sync;		/* Flush to disk to ensure consistent state
+								 * of data */
 
 	stream_stop_callback stream_stop;	/* Stop streaming when returns true */
 
-- 
2.9.3

#10Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#7)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 3, 2016 at 3:21 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net>
wrote:

On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <

michael.paquier@gmail.com>

wrote:

On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 5/13/16 2:39 AM, Michael Paquier wrote:

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.

Well, why do we provide a --nosync option for initdb? Wouldn't the

argument

basically be the same?

Yes, the good-for-testing-but-not-production argument.

I agree it kind of feels like overkill, but it would be consistent

overkill?

:)

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?

Yes, we should definitely not allow that combination. In fact, maybe that
argument in itself is enough not to have it for pg_receivexlog -- the cause
of confusion.

I can see how you might want to avoid it for pg_basebackup during testing
for example,. but the overhead on pg_receivexlog shouldn't be as bad in
testing, should it?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#11Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#10)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 3, 2016 at 10:26 PM, Magnus Hagander <magnus@hagander.net> wrote:

Yes, we should definitely not allow that combination. In fact, maybe that
argument in itself is enough not to have it for pg_receivexlog -- the cause
of confusion.

I can see how you might want to avoid it for pg_basebackup during testing
for example,. but the overhead on pg_receivexlog shouldn't be as bad in
testing, should it?

No, I haven't tested directly but it should not be.. pg_basebackup
does always a bulk write, while pg_receivexlog depends on the server
activity.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#9)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 9/3/16 9:23 AM, Michael Paquier wrote:

On Sat, Sep 3, 2016 at 10:22 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?

With patches that's actually better..

Meh, meh et meh.

In fsync_pgdata(), you have removed the progname from one error
message, even though it is passed into the function. Also, I think
fsync_pgdata() should not be printing initdb's progress messages.
That should stay in initdb. Worse, in 0002 you are then changing the
output, presumably to suit pg_basebackup or something else, which
messes up the initdb output.

durable_rename() does not fsync an existing newfile before renaming,
in contrast to the backend code in fd.c.

I'm slightly concerned about lots of code duplication in
durable_rename, fsync_parent_path between backend and standalone code.
Also, I'm equally slightly concerned that having duplicate function
names will mess up tag search for someone.

This is a preexisting mistake, but fsync_fname_ext() should just be
fsync_fname() to match the backend naming. In the backend,
fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm
argument, but the initdb code doesn't do that.

I don't think tar file output in pg_basebackup needs to be fsynced.
It's just a backup file like what pg_dump produces, and we don't fsync
that either. The point of this change is to leave a data directory in
a synced state equivalent to what initdb leaves behind.

Some of the changes in receivelog.c seem excessive. Replacing a plain
fsync() with fsync_fname_ext() plus fsync_parent_path() isn't
justified by the references you have provided. This would need more
explanation.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Michael Paquier (#7)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 3 Sep. 2016 9:22 pm, "Michael Paquier" <michael.paquier@gmail.com> wrote:

On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net>

wrote:

On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <

michael.paquier@gmail.com>

wrote:

On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 5/13/16 2:39 AM, Michael Paquier wrote:

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.

Well, why do we provide a --nosync option for initdb? Wouldn't the

argument

basically be the same?

Yes, the good-for-testing-but-not-production argument.

We need it for tap tests. More and more will use pg_basebackup and it'll
start hurting test speeds badly.

#14Michael Paquier
michael.paquier@gmail.com
In reply to: Craig Ringer (#13)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

We need it for tap tests. More and more will use pg_basebackup and it'll
start hurting test speeds badly.

Ah yes, that's a good argument. hamster would suffer pretty badly
after that if nothing is done. I'll get an extra patch out for that,
with --no-sync and not --nosync by the way.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#14)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 2016-09-13 10:35:38 +0900, Michael Paquier wrote:

On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

We need it for tap tests. More and more will use pg_basebackup and it'll
start hurting test speeds badly.

Ah yes, that's a good argument. hamster would suffer pretty badly
after that if nothing is done. I'll get an extra patch out for that,
with --no-sync and not --nosync by the way.

FWIW, it might be better to instead use eatmydata in the cron
invocations on such slow machines, that way we also test the fsync paths
in them.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#15)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Tue, Sep 13, 2016 at 10:37 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-09-13 10:35:38 +0900, Michael Paquier wrote:

On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

We need it for tap tests. More and more will use pg_basebackup and it'll
start hurting test speeds badly.

Ah yes, that's a good argument. hamster would suffer pretty badly
after that if nothing is done. I'll get an extra patch out for that,
with --no-sync and not --nosync by the way.

FWIW, it might be better to instead use eatmydata in the cron
invocations on such slow machines, that way we also test the fsync paths
in them.

Er, well.. libeatmydata is only available in AUR for Archlinux x86_64,
and not in sight for Arch ARM...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Eisentraut (#12)
4 attachment(s)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Sat, Sep 10, 2016 at 6:27 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

In fsync_pgdata(), you have removed the progname from one error
message, even though it is passed into the function.

Right. Good catch.

Also, I think
fsync_pgdata() should not be printing initdb's progress messages.
That should stay in initdb.

That's a reference to that.
+   fputs(_("syncing data to disk ... "), stdout);
+   fflush(stdout);
Oops, fixed. I missed that now that I look at it. Perhaps I was
thinking something else when hacking that lastly, like getting this
message out for pg_basebackup as well.

Worse, in 0002 you are then changing the
output, presumably to suit pg_basebackup or something else, which
messes up the initdb output.

Yes, fixed.

durable_rename() does not fsync an existing newfile before renaming,
in contrast to the backend code in fd.c.

That's in 0002. In the case of initdb it did not really matter, but
that matters for pg_receivexlog, so let's do it unconditionally. I
reworked the code to check if the existing new file is here, and
fsync() if it exists.

I'm slightly concerned about lots of code duplication in
durable_rename, fsync_parent_path between backend and standalone code.

Based on the ground of code readability, I am not sure that it would
be worth sharing the code of durable_rename or even fsync_parent_path
between the backend and the frontend, particularly because they are
actually not really duplicated... One good step in favor of that would
be to get a elog()/ereport() layer in src/common as well, but that's
really something that this set of patches should tackle, no?

Also, I'm equally slightly concerned that having duplicate function
names will mess up tag search for someone.

There are similar cases, take palloc().. Now perhaps some of those
functions could be renamed pg_fsync_... Thoughts?

This is a preexisting mistake, but fsync_fname_ext() should just be
fsync_fname() to match the backend naming. In the backend,
fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm
argument, but the initdb code doesn't do that.

OK.

I don't think tar file output in pg_basebackup needs to be fsynced.
It's just a backup file like what pg_dump produces, and we don't fsync
that either. The point of this change is to leave a data directory in
a synced state equivalent to what initdb leaves behind.

Here I don't agree. The point of the patch is to make sure that what
gets generated by pg_basebackup is consistent on disk, for all the
formats. It seems weird to me that we could say that the plain format
makes things consistent while the tar format may cause some files to
be lost in case of power outage. The docs make it clear I think:
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.
But perhaps this should directly mention that this is done for all the formats?

Some of the changes in receivelog.c seem excessive. Replacing a plain
fsync() with fsync_fname_ext() plus fsync_parent_path() isn't
justified by the references you have provided. This would need more
explanation.

In mark_file_as_archived(), we had better do it or in case of a crash
the .done file would just be gone. open_walfile() as well need that,
to ensure that the segment is created and zeroed, and in the case
where it was padded (is this one overthinking it?).

Patch 0003 had conflicts caused by 9083353.
--
Michael

Attachments:

0001-Relocation-fsync-routines-of-initdb-into-src-common.patchtext/x-diff; charset=US-ASCII; name=0001-Relocation-fsync-routines-of-initdb-into-src-common.patchDownload
From 6158dc8f70dd100f59d7d2c4b7a60435b7c1cfac Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 13 Sep 2016 12:14:12 +0900
Subject: [PATCH 1/4] Relocation fsync routines of initdb into src/common

Those are aimed at being used by other utilities, pg_basebackup mainly,
and at some other degree by pg_dump and pg_receivexlog.
---
 src/bin/initdb/initdb.c         | 270 ++-------------------------------------
 src/common/Makefile             |   2 +-
 src/common/file_utils.c         | 276 ++++++++++++++++++++++++++++++++++++++++
 src/include/common/file_utils.h |  22 ++++
 src/tools/msvc/Mkvcbuild.pm     |   2 +-
 5 files changed, 311 insertions(+), 261 deletions(-)
 create mode 100644 src/common/file_utils.c
 create mode 100644 src/include/common/file_utils.h

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3350e13..e52e67d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -61,6 +61,7 @@
 #endif
 
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/username.h"
 #include "mb/pg_wchar.h"
@@ -70,13 +71,6 @@
 #include "fe_utils/string_utils.h"
 
 
-/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
-#if defined(HAVE_SYNC_FILE_RANGE)
-#define PG_FLUSH_DATA_WORKS 1
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-#define PG_FLUSH_DATA_WORKS 1
-#endif
-
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
@@ -237,13 +231,6 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks);
-#ifdef PG_FLUSH_DATA_WORKS
-static void pre_sync_fname(const char *fname, bool isdir);
-#endif
-static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -270,7 +257,6 @@ static void load_plpgsql(FILE *cmdfd);
 static void vacuum_db(FILE *cmdfd);
 static void make_template0(FILE *cmdfd);
 static void make_postgres(FILE *cmdfd);
-static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -529,177 +515,6 @@ writefile(char *path, char **lines)
 }
 
 /*
- * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself).
- *
- * If process_symlinks is true, the action and recursion are also applied
- * to regular files and directories that are pointed to by symlinks in the
- * given directory; otherwise symlinks are ignored.  Symlinks are always
- * ignored in subdirectories, ie we intentionally don't pass down the
- * process_symlinks flag to recursive calls.
- *
- * Errors are reported but not considered fatal.
- *
- * See also walkdir in fd.c, which is a backend version of this logic.
- */
-static void
-walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
-{
-	DIR		   *dir;
-	struct dirent *de;
-
-	dir = opendir(path);
-	if (dir == NULL)
-	{
-		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-		return;
-	}
-
-	while (errno = 0, (de = readdir(dir)) != NULL)
-	{
-		char		subpath[MAXPGPATH];
-		struct stat fst;
-		int			sret;
-
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
-			continue;
-
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
-
-		if (process_symlinks)
-			sret = stat(subpath, &fst);
-		else
-			sret = lstat(subpath, &fst);
-
-		if (sret < 0)
-		{
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, subpath, strerror(errno));
-			continue;
-		}
-
-		if (S_ISREG(fst.st_mode))
-			(*action) (subpath, false);
-		else if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action, false);
-	}
-
-	if (errno)
-		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-
-	(void) closedir(dir);
-
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
-	(*action) (path, true);
-}
-
-/*
- * Hint to the OS that it should get ready to fsync() this file.
- *
- * Ignores errors trying to open unreadable files, and reports other errors
- * non-fatally.
- */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static void
-pre_sync_fname(const char *fname, bool isdir)
-{
-	int			fd;
-
-	fd = open(fname, O_RDONLY | PG_BINARY);
-
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	/*
-	 * We do what pg_flush_data() would do in the backend: prefer to use
-	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
-	 * because this is only a hint.
-	 */
-#if defined(HAVE_SYNC_FILE_RANGE)
-	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
-#else
-#error PG_FLUSH_DATA_WORKS should not have been defined
-#endif
-
-	(void) close(fd);
-}
-
-#endif   /* PG_FLUSH_DATA_WORKS */
-
-/*
- * fsync_fname_ext -- Try to fsync a file or directory
- *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required.  Reports
- * other errors non-fatally.
- */
-static void
-fsync_fname_ext(const char *fname, bool isdir)
-{
-	int			fd;
-	int			flags;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
-	 * not writable by our userid, but we assume that's OK.
-	 */
-	flags = PG_BINARY;
-	if (!isdir)
-		flags |= O_RDWR;
-	else
-		flags |= O_RDONLY;
-
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
-	fd = open(fname, flags);
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	returncode = fsync(fd);
-
-	/*
-	 * Some OSes don't allow us to fsync directories at all, so we can ignore
-	 * those errors. Anything else needs to be reported.
-	 */
-	if (returncode != 0 && !(isdir && errno == EBADF))
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-
-	(void) close(fd);
-}
-
-/*
  * Open a subcommand with suitable error messaging
  */
 static FILE *
@@ -2276,77 +2091,6 @@ make_postgres(FILE *cmdfd)
 		PG_CMD_PUTS(*line);
 }
 
-/*
- * Issue fsync recursively on PGDATA and all its contents.
- *
- * We fsync regular files and directories wherever they are, but we
- * follow symlinks only for pg_xlog and immediately under pg_tblspc.
- * Other symlinks are presumed to point at files we're not responsible
- * for fsyncing, and might not have privileges to write at all.
- *
- * Errors are reported but not considered fatal.
- */
-static void
-fsync_pgdata(void)
-{
-	bool		xlog_is_symlink;
-	char		pg_xlog[MAXPGPATH];
-	char		pg_tblspc[MAXPGPATH];
-
-	fputs(_("syncing data to disk ... "), stdout);
-	fflush(stdout);
-
-	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
-	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
-
-	/*
-	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
-	 * because the first walkdir below will ignore it.
-	 */
-	xlog_is_symlink = false;
-
-#ifndef WIN32
-	{
-		struct stat st;
-
-		if (lstat(pg_xlog, &st) < 0)
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, pg_xlog, strerror(errno));
-		else if (S_ISLNK(st.st_mode))
-			xlog_is_symlink = true;
-	}
-#else
-	if (pgwin32_is_junction(pg_xlog))
-		xlog_is_symlink = true;
-#endif
-
-	/*
-	 * If possible, hint to the kernel that we're soon going to fsync the data
-	 * directory and its contents.
-	 */
-#ifdef PG_FLUSH_DATA_WORKS
-	walkdir(pg_data, pre_sync_fname, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, pre_sync_fname, false);
-	walkdir(pg_tblspc, pre_sync_fname, true);
-#endif
-
-	/*
-	 * Now we do the fsync()s in the same order.
-	 *
-	 * The main call ignores symlinks, so in addition to specially processing
-	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
-	 * process_symlinks = true.  Note that if there are any plain directories
-	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
-	 * so we don't worry about optimizing it.
-	 */
-	walkdir(pg_data, fsync_fname_ext, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, fsync_fname_ext, false);
-	walkdir(pg_tblspc, fsync_fname_ext, true);
-
-	check_ok();
-}
 
 
 /*
@@ -3512,7 +3256,10 @@ main(int argc, char *argv[])
 			exit_nicely();
 		}
 
-		fsync_pgdata();
+		fputs(_("syncing data to disk ... "), stdout);
+		fflush(stdout);
+		fsync_pgdata(pg_data, progname);
+		check_ok();
 		return 0;
 	}
 
@@ -3574,7 +3321,12 @@ main(int argc, char *argv[])
 	initialize_data_directory();
 
 	if (do_sync)
-		fsync_pgdata();
+	{
+		fputs(_("syncing data to disk ... "), stdout);
+		fflush(stdout);
+		fsync_pgdata(pg_data, progname);
+		check_ok();
+	}
 	else
 		printf(_("\nSync to disk skipped.\nThe data directory might become corrupt if the operating system crashes.\n"));
 
diff --git a/src/common/Makefile b/src/common/Makefile
index a5fa649..03dfaa1 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -44,7 +44,7 @@ OBJS_COMMON = config_info.o controldata_utils.o exec.o ip.o keywords.o \
 	md5.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o rmtree.o \
 	string.o username.o wait_error.o
 
-OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o restricted_token.o
+OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o file_utils.o restricted_token.o
 
 OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
new file mode 100644
index 0000000..b6f62f7
--- /dev/null
+++ b/src/common/file_utils.c
@@ -0,0 +1,276 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines.
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/common/file_utils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_utils.h"
+
+
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir,
+						   const char *progname);
+#endif
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir, const char *progname),
+	bool process_symlinks, const char *progname);
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and immediately under pg_tblspc.
+ * Other symlinks are presumed to point at files we're not responsible
+ * for fsyncing, and might not have privileges to write at all.
+ *
+ * Errors are reported but not considered fatal.
+ */
+void
+fsync_pgdata(const char *pg_data, const char *progname)
+{
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
+
+	/*
+	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+#ifndef WIN32
+	{
+		struct stat st;
+
+		if (lstat(pg_xlog, &st) < 0)
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, pg_xlog, strerror(errno));
+		else if (S_ISLNK(st.st_mode))
+			xlog_is_symlink = true;
+	}
+#else
+	if (pgwin32_is_junction(pg_xlog))
+		xlog_is_symlink = true;
+#endif
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync the data
+	 * directory and its contents.
+	 */
+#ifdef PG_FLUSH_DATA_WORKS
+	walkdir(pg_data, pre_sync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, pre_sync_fname, false, progname);
+	walkdir(pg_tblspc, pre_sync_fname, true, progname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * The main call ignores symlinks, so in addition to specially processing
+	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
+	 * process_symlinks = true.  Note that if there are any plain directories
+	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
+	 * so we don't worry about optimizing it.
+	 */
+	walkdir(pg_data, fsync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, fsync_fname, false, progname);
+	walkdir(pg_tblspc, fsync_fname, true, progname);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * If process_symlinks is true, the action and recursion are also applied
+ * to regular files and directories that are pointed to by symlinks in the
+ * given directory; otherwise symlinks are ignored.  Symlinks are always
+ * ignored in subdirectories, ie we intentionally don't pass down the
+ * process_symlinks flag to recursive calls.
+ *
+ * Errors are reported but not considered fatal.
+ *
+ * See also walkdir in fd.c, which is a backend version of this logic.
+ */
+static void
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir, const char *progname),
+		bool process_symlinks, const char *progname)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+		return;
+	}
+
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+		int			sret;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, &fst);
+		else
+			sret = lstat(subpath, &fst);
+
+		if (sret < 0)
+		{
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, subpath, strerror(errno));
+			continue;
+		}
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false, progname);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false, progname);
+	}
+
+	if (errno)
+		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+
+	(void) closedir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true, progname);
+}
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Ignores errors trying to open unreadable files, and reports other errors
+ * non-fatally.
+ */
+#ifdef PG_FLUSH_DATA_WORKS
+
+static void
+pre_sync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	/*
+	 * We do what pg_flush_data() would do in the backend: prefer to use
+	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
+	 * because this is only a hint.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#else
+#error PG_FLUSH_DATA_WORKS should not have been defined
+#endif
+
+	(void) close(fd);
+}
+
+#endif   /* PG_FLUSH_DATA_WORKS */
+
+/*
+ * fsync_fname -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required.  Reports
+ * other errors non-fatally.
+ */
+void
+fsync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+
+	(void) close(fd);
+}
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
new file mode 100644
index 0000000..d3794df
--- /dev/null
+++ b/src/include/common/file_utils.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines for frontend code
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/common/file_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FILE_UTILS_H
+#define FILE_UTILS_H
+
+extern void fsync_fname(const char *fname, bool isdir,
+						const char *progname);
+extern void fsync_pgdata(const char *pg_data, const char *progname);
+
+#endif   /* FILE_UTILS_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 93dfd24..3b925b6 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -115,7 +115,7 @@ sub mkvcbuild
 	  string.c username.c wait_error.c);
 
 	our @pgcommonfrontendfiles = (
-		@pgcommonallfiles, qw(fe_memutils.c
+		@pgcommonallfiles, qw(file_utils.c fe_memutils.c
 		  restricted_token.c));
 
 	our @pgcommonbkndfiles = @pgcommonallfiles;
-- 
2.10.0

0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchtext/x-diff; charset=US-ASCII; name=0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchDownload
From 4eb14fd2b6eaf7ea61932e2d9684a8b4dfc1e097 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 13 Sep 2016 13:00:51 +0900
Subject: [PATCH 2/4] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup [-X] stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This adds a couple of routines in fe_utils which have an equivalent in the
backend:
- durable_rename
- fsync_parent_path
---
 src/bin/pg_basebackup/pg_basebackup.c |  27 +++++++++
 src/bin/pg_basebackup/receivelog.c    |  56 ++++++++++-------
 src/common/file_utils.c               | 110 ++++++++++++++++++++++++++++++++--
 src/include/common/file_utils.h       |   7 ++-
 4 files changed, 171 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 42f3b27..2266e34 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,7 @@
 #include <zlib.h>
 #endif
 
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
@@ -1194,6 +1195,10 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
+
+	/* sync the resulting tar file, errors are not considered fatal */
+	if (strcmp(basedir, "-") != 0)
+		(void) fsync_fname(filename, false, progname);
 }
 
 
@@ -1470,6 +1475,11 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (basetablespace && writerecoveryconf)
 		WriteRecoveryConf();
+
+	/*
+	 * No data is synced here, everything is done for all tablespaces at the
+	 * end.
+	 */
 }
 
 /*
@@ -1948,6 +1958,23 @@ BaseBackup(void)
 	PQclear(res);
 	PQfinish(conn);
 
+	/*
+	 * Make data persistent on disk once backup is completed. For tar
+	 * format once syncing the parent directory is fine, each tar file
+	 * created per tablespace has been already synced. In plain format,
+	 * all the data of the base directory is synced, taking into account
+	 * all the tablespaces. Errors are not considered fatal.
+	 */
+	if (format == 't')
+	{
+		if (strcmp(basedir, "-") != 0)
+			(void) fsync_fname(basedir, true, progname);
+	}
+	else
+	{
+		(void) fsync_pgdata(basedir, progname);
+	}
+
 	if (verbose)
 		fprintf(stderr, "%s: base backup completed\n", progname);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 062730b..0a740ce 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -23,6 +23,7 @@
 
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
+#include "common/file_utils.h"
 
 
 /* fd and filename for currently open WAL file */
@@ -68,17 +69,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname(tmppath, false, progname) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath, progname) != 0)
+		return false;
 
 	return true;
 }
@@ -116,6 +113,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +130,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname(fn, false, progname) != 0)
+			return false;
+		if (fsync_parent_path(fn, progname) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +168,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname(fn, false, progname) != 0)
+		return false;
+	if (fsync_parent_path(fn, progname) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +239,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn, progname) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +359,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +369,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path, progname) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index b6f62f7..4e622cb 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -34,7 +34,7 @@ static void pre_sync_fname(const char *fname, bool isdir,
 						   const char *progname);
 #endif
 static void walkdir(const char *path,
-	void (*action) (const char *fname, bool isdir, const char *progname),
+	int (*action) (const char *fname, bool isdir, const char *progname),
 	bool process_symlinks, const char *progname);
 
 /*
@@ -120,7 +120,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
  */
 static void
 walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir, const char *progname),
+		int (*action) (const char *fname, bool isdir, const char *progname),
 		bool process_symlinks, const char *progname)
 {
 	DIR		   *dir;
@@ -228,7 +228,7 @@ pre_sync_fname(const char *fname, bool isdir, const char *progname)
  * directories on systems where that isn't allowed/required.  Reports
  * other errors non-fatally.
  */
-void
+int
 fsync_fname(const char *fname, bool isdir, const char *progname)
 {
 	int			fd;
@@ -256,10 +256,10 @@ fsync_fname(const char *fname, bool isdir, const char *progname)
 	if (fd < 0)
 	{
 		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
+			return 0;
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
-		return;
+		return -1;
 	}
 
 	returncode = fsync(fd);
@@ -269,8 +269,108 @@ fsync_fname(const char *fname, bool isdir, const char *progname)
 	 * those errors. Anything else needs to be reported.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
+		return -1;
+	}
 
 	(void) close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+int
+fsync_parent_path(const char *fname, const char *progname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname(parentpath, true, progname) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, const char *progname)
+{
+	int		fd;
+
+	/*
+	 * First fsync the old and target path (if it exists), to ensure that they
+	 * are properly persistent on disk. Syncing the target file is not
+	 * strictly necessary, but it makes it easier to reason about crashes;
+	 * because it's then guaranteed that either source or target file exists
+	 * after a crash.
+	 */
+	if (fsync_fname(oldfile, false, progname) != 0)
+		return -1;
+
+	fd = open(newfile, PG_BINARY | O_RDWR, 0);
+	if (fd < 0)
+	{
+		if (errno != ENOENT)
+		{
+			fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+					progname, newfile, strerror(errno));
+			return -1;
+		}
+	}
+	else
+	{
+		if (fsync(fd) != 0)
+		{
+			int save_errno;
+
+			/* close file upon error */
+			save_errno = errno;
+			close(fd);
+			errno = save_errno;
+
+			fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+					progname, newfile, strerror(errno));
+			return -1;
+		}
+	}
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname(newfile, false, progname) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile, progname) != 0)
+		return -1;
+
+	return 0;
 }
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index d3794df..1cb263d 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -15,8 +15,11 @@
 #ifndef FILE_UTILS_H
 #define FILE_UTILS_H
 
-extern void fsync_fname(const char *fname, bool isdir,
-						const char *progname);
+extern int fsync_fname(const char *fname, bool isdir,
+					   const char *progname);
 extern void fsync_pgdata(const char *pg_data, const char *progname);
+extern int durable_rename(const char *oldfile, const char *newfile,
+						  const char *progname);
+extern int fsync_parent_path(const char *fname, const char *progname);
 
 #endif   /* FILE_UTILS_H */
-- 
2.10.0

0003-Add-no-sync-option-to-pg_basebackup.patchtext/x-diff; charset=US-ASCII; name=0003-Add-no-sync-option-to-pg_basebackup.patchDownload
From 1f7f5a664d15065aee2a431d1e1a3a3657dbd4f3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 13 Sep 2016 13:07:12 +0900
Subject: [PATCH 3/4] Add --no-sync option to pg_basebackup

This makes pg_basebackup, and is useful for testing.
---
 doc/src/sgml/ref/pg_basebackup.sgml    | 15 +++++++++++++++
 src/bin/pg_basebackup/pg_basebackup.c  | 28 +++++++++++++++++++---------
 src/bin/pg_basebackup/pg_receivexlog.c |  1 +
 src/bin/pg_basebackup/receivelog.c     | 20 +++++++++++---------
 src/bin/pg_basebackup/receivelog.h     |  4 +++-
 5 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 9f1eae1..3f1938a 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -439,6 +439,21 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-N</option></term>
+      <term><option>--no-sync</option></term>
+      <listitem>
+       <para>
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.  This option causes
+        <command>pg_basebackup</command> to return without waiting, which is
+        faster, but means that a subsequent operating system crash can leave
+        the base backup corrupt.  Generally, this option is useful for testing
+        but should not be used when creating a production installation.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-v</option></term>
       <term><option>--verbose</option></term>
       <listitem>
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 2266e34..e2f1629 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -67,6 +67,7 @@ static bool includewal = false;
 static bool streamwal = false;
 static bool fastcheckpoint = false;
 static bool writerecoveryconf = false;
+static bool do_sync = true;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static pg_time_t last_progress_report = 0;
 static int32 maxrate = 0;		/* no limit by default */
@@ -326,6 +327,7 @@ usage(void)
 	printf(_("  -c, --checkpoint=fast|spread\n"
 			 "                         set fast or spread checkpointing\n"));
 	printf(_("  -l, --label=LABEL      set backup label\n"));
+	printf(_("  -N, --no-sync           do not wait for changes to be written safely to disk\n"));
 	printf(_("  -n, --noclean          do not clean up after errors\n"));
 	printf(_("  -P, --progress         show progress information\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
@@ -458,6 +460,7 @@ LogStreamerMain(logstreamer_param *param)
 	stream.stream_stop = reached_end_position;
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = false;
+	stream.do_sync = do_sync;
 	stream.mark_done = true;
 	stream.basedir = param->xlogdir;
 	stream.partial_suffix = NULL;
@@ -1197,7 +1200,7 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		PQfreemem(copybuf);
 
 	/* sync the resulting tar file, errors are not considered fatal */
-	if (strcmp(basedir, "-") != 0)
+	if (do_sync && strcmp(basedir, "-") != 0)
 		(void) fsync_fname(filename, false, progname);
 }
 
@@ -1965,14 +1968,17 @@ BaseBackup(void)
 	 * all the data of the base directory is synced, taking into account
 	 * all the tablespaces. Errors are not considered fatal.
 	 */
-	if (format == 't')
+	if (do_sync)
 	{
-		if (strcmp(basedir, "-") != 0)
-			(void) fsync_fname(basedir, true, progname);
-	}
-	else
-	{
-		(void) fsync_pgdata(basedir, progname);
+		if (format == 't')
+		{
+			if (strcmp(basedir, "-") != 0)
+				(void) fsync_fname(basedir, true, progname);
+		}
+		else
+		{
+			(void) fsync_pgdata(basedir, progname);
+		}
 	}
 
 	if (verbose)
@@ -1999,6 +2005,7 @@ main(int argc, char **argv)
 		{"compress", required_argument, NULL, 'Z'},
 		{"label", required_argument, NULL, 'l'},
 		{"noclean", no_argument, NULL, 'n'},
+		{"no-sync", no_argument, NULL, 'N'},
 		{"dbname", required_argument, NULL, 'd'},
 		{"host", required_argument, NULL, 'h'},
 		{"port", required_argument, NULL, 'p'},
@@ -2035,7 +2042,7 @@ main(int argc, char **argv)
 
 	atexit(cleanup_directories_atexit);
 
-	while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:nzZ:d:c:h:p:U:s:S:wWvP",
+	while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:nNzZ:d:c:h:p:U:s:S:wWvP",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -2059,6 +2066,9 @@ main(int argc, char **argv)
 			case 'r':
 				maxrate = parse_max_rate(optarg);
 				break;
+			case 'N':
+				do_sync = false;
+				break;
 			case 'R':
 				writerecoveryconf = true;
 				break;
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 7f7ee9d..a58a251 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -336,6 +336,7 @@ StreamLog(void)
 	stream.stream_stop = stop_streaming;
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = synchronous;
+	stream.do_sync = true;
 	stream.mark_done = false;
 	stream.basedir = basedir;
 	stream.partial_suffix = ".partial";
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 0a740ce..0c01bfc 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -53,7 +53,7 @@ static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
 
 static bool
-mark_file_as_archived(const char *basedir, const char *fname)
+mark_file_as_archived(const char *basedir, const char *fname, bool do_sync)
 {
 	int			fd;
 	static char tmppath[MAXPGPATH];
@@ -71,10 +71,10 @@ mark_file_as_archived(const char *basedir, const char *fname)
 
 	close(fd);
 
-	if (fsync_fname(tmppath, false, progname) != 0)
+	if (do_sync && fsync_fname(tmppath, false, progname) != 0)
 		return false;
 
-	if (fsync_parent_path(tmppath, progname) != 0)
+	if (do_sync && fsync_parent_path(tmppath, progname) != 0)
 		return false;
 
 	return true;
@@ -135,9 +135,9 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 		 * fsync, in case of a previous crash between padding and fsyncing the
 		 * file.
 		 */
-		if (fsync_fname(fn, false, progname) != 0)
+		if (stream->do_sync && fsync_fname(fn, false, progname) != 0)
 			return false;
-		if (fsync_parent_path(fn, progname) != 0)
+		if (stream->do_sync && fsync_parent_path(fn, progname) != 0)
 			return false;
 
 		return true;
@@ -174,9 +174,9 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	 * using synchronous mode, where the file is modified and fsynced
 	 * in-place, without a directory fsync.
 	 */
-	if (fsync_fname(fn, false, progname) != 0)
+	if (stream->do_sync && fsync_fname(fn, false, progname) != 0)
 		return false;
-	if (fsync_parent_path(fn, progname) != 0)
+	if (stream->do_sync && fsync_parent_path(fn, progname) != 0)
 		return false;
 
 	if (lseek(f, SEEK_SET, 0) != 0)
@@ -259,7 +259,8 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 	if (currpos == XLOG_SEG_SIZE && stream->mark_done)
 	{
 		/* writes error message if failed */
-		if (!mark_file_as_archived(stream->basedir, current_walfile_name))
+		if (!mark_file_as_archived(stream->basedir, current_walfile_name,
+								   stream->do_sync))
 			return false;
 	}
 
@@ -379,7 +380,8 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	if (stream->mark_done)
 	{
 		/* writes error message if failed */
-		if (!mark_file_as_archived(stream->basedir, histfname))
+		if (!mark_file_as_archived(stream->basedir, histfname,
+								   stream->do_sync))
 			return false;
 	}
 
diff --git a/src/bin/pg_basebackup/receivelog.h b/src/bin/pg_basebackup/receivelog.h
index 554ff8b..7a3bbc5 100644
--- a/src/bin/pg_basebackup/receivelog.h
+++ b/src/bin/pg_basebackup/receivelog.h
@@ -34,8 +34,10 @@ typedef struct StreamCtl
 								 * timeline */
 	int			standby_message_timeout;		/* Send status messages this
 												 * often */
-	bool		synchronous;	/* Flush data on write */
+	bool		synchronous;	/* Flush immediately WAL data on write */
 	bool		mark_done;		/* Mark segment as done in generated archive */
+	bool		do_sync;		/* Flush to disk to ensure consistent state
+								 * of data */
 
 	stream_stop_callback stream_stop;	/* Stop streaming when returns true */
 
-- 
2.10.0

0004-Switch-pg_basebackup-commands-in-Postgres.pm-to-use-.patchtext/x-diff; charset=US-ASCII; name=0004-Switch-pg_basebackup-commands-in-Postgres.pm-to-use-.patchDownload
From ec8fcb31c71b354fe6c29124bbe2757765911235 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 13 Sep 2016 13:10:45 +0900
Subject: [PATCH 4/4] Switch pg_basebackup commands in Postgres.pm to use
 --no-sync

On low-spec machines, this greatly reduces the I/O pressure induced by the
tests.
---
 src/test/perl/PostgresNode.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index fede1e6..7210f30 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -476,7 +476,7 @@ sub backup
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p', $port,
-		'-x');
+		'-x', '--no-sync');
 	print "# Backup finished\n";
 }
 
-- 
2.10.0

#18Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#17)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 9/12/16 11:16 PM, Michael Paquier wrote:

I don't think tar file output in pg_basebackup needs to be fsynced.

It's just a backup file like what pg_dump produces, and we don't fsync
that either. The point of this change is to leave a data directory in
a synced state equivalent to what initdb leaves behind.

Here I don't agree. The point of the patch is to make sure that what
gets generated by pg_basebackup is consistent on disk, for all the
formats. It seems weird to me that we could say that the plain format
makes things consistent while the tar format may cause some files to
be lost in case of power outage. The docs make it clear I think:
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.
But perhaps this should directly mention that this is done for all the formats?

That doesn't really explain why we fsync. If we think that all
"important" files should be fsynced, why aren't we doing it in pg_dump?
Or psql, or server-side copy? Similarly, there is no straightforward
mechanism by which you can unpack the tar file generated by
pg_basebackup and get the unpacked directory fsynced properly. (I
suppose initdb -S should be recommended.)

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Eisentraut (#18)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Thu, Sep 15, 2016 at 9:44 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 9/12/16 11:16 PM, Michael Paquier wrote:

I don't think tar file output in pg_basebackup needs to be fsynced.

It's just a backup file like what pg_dump produces, and we don't fsync
that either. The point of this change is to leave a data directory in
a synced state equivalent to what initdb leaves behind.

Here I don't agree. The point of the patch is to make sure that what
gets generated by pg_basebackup is consistent on disk, for all the
formats. It seems weird to me that we could say that the plain format
makes things consistent while the tar format may cause some files to
be lost in case of power outage. The docs make it clear I think:
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.
But perhaps this should directly mention that this is done for all the formats?

That doesn't really explain why we fsync.

Data durability, particularly on ext4 as it has been discussed a
couple of months back [1]Silent data loss on ext4: /messages/by-id/56583BDD.9060302@2ndquadrant.com. In case if a crash it could be perfectly
possible to lose files, hence we need to be sure that the files
themselves are flushed, as well as their parent directory to prevent
any problems. I think that we should protect users' backups as much as
we can, in the range we can.

If we think that all
"important" files should be fsynced, why aren't we doing it in pg_dump?

pg_dump should do it where it can, see thread [2]Data durability: /messages/by-id/20160327233033.GD20662@awork2.anarazel.de -- Michael. I am tackling
problems once at a time, and that's as well a reason why I would like
us to have a common set of routines in src/common or src/fe_utils to
help improve this handling.

Or psql, or server-side copy? Similarly, there is no straightforward
mechanism by which you can unpack the tar file generated by
pg_basebackup and get the unpacked directory fsynced properly. (I
suppose initdb -S should be recommended.)

Yes, those are cases that you cannot cover. Imagine for example
pg_basebackup's tar or pg_dump data sent to stdout. There is nothing
we can actually do for all those cases. However what we can do it
giving a set of options making possible to get consistent backups for
users.

[1]: Silent data loss on ext4: /messages/by-id/56583BDD.9060302@2ndquadrant.com
/messages/by-id/56583BDD.9060302@2ndquadrant.com

[2]: Data durability: /messages/by-id/20160327233033.GD20662@awork2.anarazel.de -- Michael
/messages/by-id/20160327233033.GD20662@awork2.anarazel.de
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#17)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

This is looking pretty good. More comments on this patch set:

0001:

Keep the file order alphabetical in Mkvcbuild.pm.

Needs nls.mk updates for file move (in initdb and pg_basebackup
directories).

0002:

durable_rename needs close(fd) in non-error path (compare backend code).

Changing from fsync() to fsync_name() in some cases means that
inaccessible files are now ignored. I guess this would only happen in
very obscure circumstances, but it's worth considering if that is OK.

You added this comment:

* XXX: This means that we might not restart if a crash occurs
before the
* fsync below. We probably should create the file in a temporary path
* like the backend does...

pg_receivexlog uses the .partial suffix for this reason. Why doesn't
pg_basebackup do that?

In open_walfile, could we move the fsync calls before the fstat or
somewhere around there so we don't have to make the same calls in two
different branches?

0003:

There was a discussion about renaming the --nosync option in initdb to
--no-sync, but until that is done, the option in pg_basebackup should
stay what initdb has right now.

There is a whitespace alignment error in usage().

The -N option should be listed after -n.

Some fsync calls are not covered by a do_sync conditional. I see some
in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg().

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Eisentraut (#20)
4 attachment(s)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Fri, Sep 23, 2016 at 10:54 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

This is looking pretty good. More comments on this patch set:

Thanks for the review.

0001:

Keep the file order alphabetical in Mkvcbuild.pm.

Needs nls.mk updates for file move (in initdb and pg_basebackup
directories).

Fixed.

0002:

durable_rename needs close(fd) in non-error path (compare backend code).

Oops, leak.

Changing from fsync() to fsync_name() in some cases means that
inaccessible files are now ignored. I guess this would only happen in
very obscure circumstances, but it's worth considering if that is OK.

In writeTimeLineHistoryFile() that's fine, the user should have
ownership of it as it has been created by receivelog.c. Similar remark
for .
In mark_file_as_archived, the previous open() call would have just failed.

You added this comment:
* XXX: This means that we might not restart if a crash occurs
before the
* fsync below. We probably should create the file in a temporary path
* like the backend does...
pg_receivexlog uses the .partial suffix for this reason. Why doesn't
pg_basebackup do that?

Because it matters for pg_receivexlog as it has a retry logic and it
is able to rework on a partial segment. Thinking more about that this
comment does not make much sense, because for pg_basebackup a crash
would just cause the backup to be incomplete, so I have removed it.

In open_walfile, could we move the fsync calls before the fstat or
somewhere around there so we don't have to make the same calls in two
different branches?

We could refactor a bit the code so as follows:
if (size == 16MB)
{
walfile = f;
}
else if (size == 0)
{
//do padding and lseek
}
else
error();
fsync();
I am not sure if this gains in clarity though, perhaps I looked too
much the current code.

0003:

There was a discussion about renaming the --nosync option in initdb to
--no-sync, but until that is done, the option in pg_basebackup should
stay what initdb has right now.

Changed.

There is a whitespace alignment error in usage().

Not sure I follow here.

The -N option should be listed after -n.

In the switch()? Fixed.

Some fsync calls are not covered by a do_sync conditional. I see some
in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg().

Right. I looked at the rest and did not notice any extra mistakes.
--
Michael

Attachments:

0001-Relocation-fsync-routines-of-initdb-into-src-common.patchapplication/x-patch; name=0001-Relocation-fsync-routines-of-initdb-into-src-common.patchDownload
From 79a9c302bc4723d6709bf5f1dc6ca673598e8b4a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 23 Sep 2016 23:23:05 +0900
Subject: [PATCH 1/4] Relocation fsync routines of initdb into src/common

Those are aimed at being used by other utilities, pg_basebackup mainly,
and at some other degree by pg_dump and pg_receivexlog.
---
 src/bin/initdb/initdb.c         | 270 ++-------------------------------------
 src/bin/initdb/nls.mk           |   2 +-
 src/bin/pg_basebackup/nls.mk    |   2 +-
 src/common/Makefile             |   2 +-
 src/common/file_utils.c         | 276 ++++++++++++++++++++++++++++++++++++++++
 src/include/common/file_utils.h |  22 ++++
 src/tools/msvc/Mkvcbuild.pm     |   2 +-
 7 files changed, 313 insertions(+), 263 deletions(-)
 create mode 100644 src/common/file_utils.c
 create mode 100644 src/include/common/file_utils.h

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3350e13..e52e67d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -61,6 +61,7 @@
 #endif
 
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/username.h"
 #include "mb/pg_wchar.h"
@@ -70,13 +71,6 @@
 #include "fe_utils/string_utils.h"
 
 
-/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
-#if defined(HAVE_SYNC_FILE_RANGE)
-#define PG_FLUSH_DATA_WORKS 1
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-#define PG_FLUSH_DATA_WORKS 1
-#endif
-
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
@@ -237,13 +231,6 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks);
-#ifdef PG_FLUSH_DATA_WORKS
-static void pre_sync_fname(const char *fname, bool isdir);
-#endif
-static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -270,7 +257,6 @@ static void load_plpgsql(FILE *cmdfd);
 static void vacuum_db(FILE *cmdfd);
 static void make_template0(FILE *cmdfd);
 static void make_postgres(FILE *cmdfd);
-static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 static char *escape_quotes(const char *src);
@@ -529,177 +515,6 @@ writefile(char *path, char **lines)
 }
 
 /*
- * walkdir: recursively walk a directory, applying the action to each
- * regular file and directory (including the named directory itself).
- *
- * If process_symlinks is true, the action and recursion are also applied
- * to regular files and directories that are pointed to by symlinks in the
- * given directory; otherwise symlinks are ignored.  Symlinks are always
- * ignored in subdirectories, ie we intentionally don't pass down the
- * process_symlinks flag to recursive calls.
- *
- * Errors are reported but not considered fatal.
- *
- * See also walkdir in fd.c, which is a backend version of this logic.
- */
-static void
-walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
-{
-	DIR		   *dir;
-	struct dirent *de;
-
-	dir = opendir(path);
-	if (dir == NULL)
-	{
-		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-		return;
-	}
-
-	while (errno = 0, (de = readdir(dir)) != NULL)
-	{
-		char		subpath[MAXPGPATH];
-		struct stat fst;
-		int			sret;
-
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
-			continue;
-
-		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
-
-		if (process_symlinks)
-			sret = stat(subpath, &fst);
-		else
-			sret = lstat(subpath, &fst);
-
-		if (sret < 0)
-		{
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, subpath, strerror(errno));
-			continue;
-		}
-
-		if (S_ISREG(fst.st_mode))
-			(*action) (subpath, false);
-		else if (S_ISDIR(fst.st_mode))
-			walkdir(subpath, action, false);
-	}
-
-	if (errno)
-		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
-				progname, path, strerror(errno));
-
-	(void) closedir(dir);
-
-	/*
-	 * It's important to fsync the destination directory itself as individual
-	 * file fsyncs don't guarantee that the directory entry for the file is
-	 * synced.  Recent versions of ext4 have made the window much wider but
-	 * it's been an issue for ext3 and other filesystems in the past.
-	 */
-	(*action) (path, true);
-}
-
-/*
- * Hint to the OS that it should get ready to fsync() this file.
- *
- * Ignores errors trying to open unreadable files, and reports other errors
- * non-fatally.
- */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static void
-pre_sync_fname(const char *fname, bool isdir)
-{
-	int			fd;
-
-	fd = open(fname, O_RDONLY | PG_BINARY);
-
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	/*
-	 * We do what pg_flush_data() would do in the backend: prefer to use
-	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
-	 * because this is only a hint.
-	 */
-#if defined(HAVE_SYNC_FILE_RANGE)
-	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
-#else
-#error PG_FLUSH_DATA_WORKS should not have been defined
-#endif
-
-	(void) close(fd);
-}
-
-#endif   /* PG_FLUSH_DATA_WORKS */
-
-/*
- * fsync_fname_ext -- Try to fsync a file or directory
- *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required.  Reports
- * other errors non-fatally.
- */
-static void
-fsync_fname_ext(const char *fname, bool isdir)
-{
-	int			fd;
-	int			flags;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
-	 * not writable by our userid, but we assume that's OK.
-	 */
-	flags = PG_BINARY;
-	if (!isdir)
-		flags |= O_RDWR;
-	else
-		flags |= O_RDONLY;
-
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
-	fd = open(fname, flags);
-	if (fd < 0)
-	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
-		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-		return;
-	}
-
-	returncode = fsync(fd);
-
-	/*
-	 * Some OSes don't allow us to fsync directories at all, so we can ignore
-	 * those errors. Anything else needs to be reported.
-	 */
-	if (returncode != 0 && !(isdir && errno == EBADF))
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, fname, strerror(errno));
-
-	(void) close(fd);
-}
-
-/*
  * Open a subcommand with suitable error messaging
  */
 static FILE *
@@ -2276,77 +2091,6 @@ make_postgres(FILE *cmdfd)
 		PG_CMD_PUTS(*line);
 }
 
-/*
- * Issue fsync recursively on PGDATA and all its contents.
- *
- * We fsync regular files and directories wherever they are, but we
- * follow symlinks only for pg_xlog and immediately under pg_tblspc.
- * Other symlinks are presumed to point at files we're not responsible
- * for fsyncing, and might not have privileges to write at all.
- *
- * Errors are reported but not considered fatal.
- */
-static void
-fsync_pgdata(void)
-{
-	bool		xlog_is_symlink;
-	char		pg_xlog[MAXPGPATH];
-	char		pg_tblspc[MAXPGPATH];
-
-	fputs(_("syncing data to disk ... "), stdout);
-	fflush(stdout);
-
-	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
-	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
-
-	/*
-	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
-	 * because the first walkdir below will ignore it.
-	 */
-	xlog_is_symlink = false;
-
-#ifndef WIN32
-	{
-		struct stat st;
-
-		if (lstat(pg_xlog, &st) < 0)
-			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
-					progname, pg_xlog, strerror(errno));
-		else if (S_ISLNK(st.st_mode))
-			xlog_is_symlink = true;
-	}
-#else
-	if (pgwin32_is_junction(pg_xlog))
-		xlog_is_symlink = true;
-#endif
-
-	/*
-	 * If possible, hint to the kernel that we're soon going to fsync the data
-	 * directory and its contents.
-	 */
-#ifdef PG_FLUSH_DATA_WORKS
-	walkdir(pg_data, pre_sync_fname, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, pre_sync_fname, false);
-	walkdir(pg_tblspc, pre_sync_fname, true);
-#endif
-
-	/*
-	 * Now we do the fsync()s in the same order.
-	 *
-	 * The main call ignores symlinks, so in addition to specially processing
-	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
-	 * process_symlinks = true.  Note that if there are any plain directories
-	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
-	 * so we don't worry about optimizing it.
-	 */
-	walkdir(pg_data, fsync_fname_ext, false);
-	if (xlog_is_symlink)
-		walkdir(pg_xlog, fsync_fname_ext, false);
-	walkdir(pg_tblspc, fsync_fname_ext, true);
-
-	check_ok();
-}
 
 
 /*
@@ -3512,7 +3256,10 @@ main(int argc, char *argv[])
 			exit_nicely();
 		}
 
-		fsync_pgdata();
+		fputs(_("syncing data to disk ... "), stdout);
+		fflush(stdout);
+		fsync_pgdata(pg_data, progname);
+		check_ok();
 		return 0;
 	}
 
@@ -3574,7 +3321,12 @@ main(int argc, char *argv[])
 	initialize_data_directory();
 
 	if (do_sync)
-		fsync_pgdata();
+	{
+		fputs(_("syncing data to disk ... "), stdout);
+		fflush(stdout);
+		fsync_pgdata(pg_data, progname);
+		check_ok();
+	}
 	else
 		printf(_("\nSync to disk skipped.\nThe data directory might become corrupt if the operating system crashes.\n"));
 
diff --git a/src/bin/initdb/nls.mk b/src/bin/initdb/nls.mk
index 0d53683..7a12daa 100644
--- a/src/bin/initdb/nls.mk
+++ b/src/bin/initdb/nls.mk
@@ -1,5 +1,5 @@
 # src/bin/initdb/nls.mk
 CATALOG_NAME     = initdb
 AVAIL_LANGUAGES  = cs de es fr it ja ko pl pt_BR ru sv zh_CN
-GETTEXT_FILES    = findtimezone.c initdb.c ../../common/exec.c ../../common/fe_memutils.c ../../common/pgfnames.c ../../common/restricted_token.c ../../common/rmtree.c ../../common/username.c ../../common/wait_error.c ../../port/dirmod.c
+GETTEXT_FILES    = findtimezone.c initdb.c ../../common/exec.c ../../common/fe_memutils.c ../../common/file_utils.c ../../common/pgfnames.c ../../common/restricted_token.c ../../common/rmtree.c ../../common/username.c ../../common/wait_error.c ../../port/dirmod.c
 GETTEXT_TRIGGERS = simple_prompt
diff --git a/src/bin/pg_basebackup/nls.mk b/src/bin/pg_basebackup/nls.mk
index a34ca3d..dba43b8 100644
--- a/src/bin/pg_basebackup/nls.mk
+++ b/src/bin/pg_basebackup/nls.mk
@@ -1,5 +1,5 @@
 # src/bin/pg_basebackup/nls.mk
 CATALOG_NAME     = pg_basebackup
 AVAIL_LANGUAGES  = de es fr it ko pl pt_BR ru zh_CN
-GETTEXT_FILES    = pg_basebackup.c pg_receivexlog.c pg_recvlogical.c receivelog.c streamutil.c ../../common/fe_memutils.c
+GETTEXT_FILES    = pg_basebackup.c pg_receivexlog.c pg_recvlogical.c receivelog.c streamutil.c ../../common/fe_memutils.c ../../common/file_utils.c
 GETTEXT_TRIGGERS = simple_prompt
diff --git a/src/common/Makefile b/src/common/Makefile
index a5fa649..03dfaa1 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -44,7 +44,7 @@ OBJS_COMMON = config_info.o controldata_utils.o exec.o ip.o keywords.o \
 	md5.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o rmtree.o \
 	string.o username.o wait_error.o
 
-OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o restricted_token.o
+OBJS_FRONTEND = $(OBJS_COMMON) fe_memutils.o file_utils.o restricted_token.o
 
 OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
new file mode 100644
index 0000000..b6f62f7
--- /dev/null
+++ b/src/common/file_utils.c
@@ -0,0 +1,276 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines.
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/common/file_utils.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "common/file_utils.h"
+
+
+/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
+#if defined(HAVE_SYNC_FILE_RANGE)
+#define PG_FLUSH_DATA_WORKS 1
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+#define PG_FLUSH_DATA_WORKS 1
+#endif
+
+#ifdef PG_FLUSH_DATA_WORKS
+static void pre_sync_fname(const char *fname, bool isdir,
+						   const char *progname);
+#endif
+static void walkdir(const char *path,
+	void (*action) (const char *fname, bool isdir, const char *progname),
+	bool process_symlinks, const char *progname);
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ *
+ * We fsync regular files and directories wherever they are, but we
+ * follow symlinks only for pg_xlog and immediately under pg_tblspc.
+ * Other symlinks are presumed to point at files we're not responsible
+ * for fsyncing, and might not have privileges to write at all.
+ *
+ * Errors are reported but not considered fatal.
+ */
+void
+fsync_pgdata(const char *pg_data, const char *progname)
+{
+	bool		xlog_is_symlink;
+	char		pg_xlog[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
+
+	snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
+
+	/*
+	 * If pg_xlog is a symlink, we'll need to recurse into it separately,
+	 * because the first walkdir below will ignore it.
+	 */
+	xlog_is_symlink = false;
+
+#ifndef WIN32
+	{
+		struct stat st;
+
+		if (lstat(pg_xlog, &st) < 0)
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, pg_xlog, strerror(errno));
+		else if (S_ISLNK(st.st_mode))
+			xlog_is_symlink = true;
+	}
+#else
+	if (pgwin32_is_junction(pg_xlog))
+		xlog_is_symlink = true;
+#endif
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync the data
+	 * directory and its contents.
+	 */
+#ifdef PG_FLUSH_DATA_WORKS
+	walkdir(pg_data, pre_sync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, pre_sync_fname, false, progname);
+	walkdir(pg_tblspc, pre_sync_fname, true, progname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * The main call ignores symlinks, so in addition to specially processing
+	 * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
+	 * process_symlinks = true.  Note that if there are any plain directories
+	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
+	 * so we don't worry about optimizing it.
+	 */
+	walkdir(pg_data, fsync_fname, false, progname);
+	if (xlog_is_symlink)
+		walkdir(pg_xlog, fsync_fname, false, progname);
+	walkdir(pg_tblspc, fsync_fname, true, progname);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * If process_symlinks is true, the action and recursion are also applied
+ * to regular files and directories that are pointed to by symlinks in the
+ * given directory; otherwise symlinks are ignored.  Symlinks are always
+ * ignored in subdirectories, ie we intentionally don't pass down the
+ * process_symlinks flag to recursive calls.
+ *
+ * Errors are reported but not considered fatal.
+ *
+ * See also walkdir in fd.c, which is a backend version of this logic.
+ */
+static void
+walkdir(const char *path,
+		void (*action) (const char *fname, bool isdir, const char *progname),
+		bool process_symlinks, const char *progname)
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+		return;
+	}
+
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+		int			sret;
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (process_symlinks)
+			sret = stat(subpath, &fst);
+		else
+			sret = lstat(subpath, &fst);
+
+		if (sret < 0)
+		{
+			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
+					progname, subpath, strerror(errno));
+			continue;
+		}
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false, progname);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action, false, progname);
+	}
+
+	if (errno)
+		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
+				progname, path, strerror(errno));
+
+	(void) closedir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true, progname);
+}
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Ignores errors trying to open unreadable files, and reports other errors
+ * non-fatally.
+ */
+#ifdef PG_FLUSH_DATA_WORKS
+
+static void
+pre_sync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	/*
+	 * We do what pg_flush_data() would do in the backend: prefer to use
+	 * sync_file_range, but fall back to posix_fadvise.  We ignore errors
+	 * because this is only a hint.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+	(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#else
+#error PG_FLUSH_DATA_WORKS should not have been defined
+#endif
+
+	(void) close(fd);
+}
+
+#endif   /* PG_FLUSH_DATA_WORKS */
+
+/*
+ * fsync_fname -- Try to fsync a file or directory
+ *
+ * Ignores errors trying to open unreadable files, or trying to fsync
+ * directories on systems where that isn't allowed/required.  Reports
+ * other errors non-fatally.
+ */
+void
+fsync_fname(const char *fname, bool isdir, const char *progname)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (errno == EACCES || (isdir && errno == EISDIR))
+			return;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+
+	(void) close(fd);
+}
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
new file mode 100644
index 0000000..d3794df
--- /dev/null
+++ b/src/include/common/file_utils.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * File-processing utility routines for frontend code
+ *
+ * Assorted utility functions to work on files.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/common/file_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FILE_UTILS_H
+#define FILE_UTILS_H
+
+extern void fsync_fname(const char *fname, bool isdir,
+						const char *progname);
+extern void fsync_pgdata(const char *pg_data, const char *progname);
+
+#endif   /* FILE_UTILS_H */
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 93dfd24..952d226 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -115,7 +115,7 @@ sub mkvcbuild
 	  string.c username.c wait_error.c);
 
 	our @pgcommonfrontendfiles = (
-		@pgcommonallfiles, qw(fe_memutils.c
+		@pgcommonallfiles, qw(fe_memutils.c file_utils.c
 		  restricted_token.c));
 
 	our @pgcommonbkndfiles = @pgcommonallfiles;
-- 
2.10.0

0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchapplication/x-patch; name=0002-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchDownload
From a010b9c6b39eb7416172a5b2c709db64fd9d9ce1 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 23 Sep 2016 23:47:07 +0900
Subject: [PATCH 2/4] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup [-X] stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyncs.

This adds a couple of routines in fe_utils which have an equivalent in the
backend:
- durable_rename
- fsync_parent_path
---
 src/bin/pg_basebackup/pg_basebackup.c |  27 +++++++++
 src/bin/pg_basebackup/receivelog.c    |  52 ++++++++++-------
 src/common/file_utils.c               | 105 ++++++++++++++++++++++++++++++++--
 src/include/common/file_utils.h       |   7 ++-
 4 files changed, 162 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 42f3b27..2266e34 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,7 @@
 #include <zlib.h>
 #endif
 
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
@@ -1194,6 +1195,10 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
+
+	/* sync the resulting tar file, errors are not considered fatal */
+	if (strcmp(basedir, "-") != 0)
+		(void) fsync_fname(filename, false, progname);
 }
 
 
@@ -1470,6 +1475,11 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (basetablespace && writerecoveryconf)
 		WriteRecoveryConf();
+
+	/*
+	 * No data is synced here, everything is done for all tablespaces at the
+	 * end.
+	 */
 }
 
 /*
@@ -1948,6 +1958,23 @@ BaseBackup(void)
 	PQclear(res);
 	PQfinish(conn);
 
+	/*
+	 * Make data persistent on disk once backup is completed. For tar
+	 * format once syncing the parent directory is fine, each tar file
+	 * created per tablespace has been already synced. In plain format,
+	 * all the data of the base directory is synced, taking into account
+	 * all the tablespaces. Errors are not considered fatal.
+	 */
+	if (format == 't')
+	{
+		if (strcmp(basedir, "-") != 0)
+			(void) fsync_fname(basedir, true, progname);
+	}
+	else
+	{
+		(void) fsync_pgdata(basedir, progname);
+	}
+
 	if (verbose)
 		fprintf(stderr, "%s: base backup completed\n", progname);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 062730b..546df9a 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -23,6 +23,7 @@
 
 #include "libpq-fe.h"
 #include "access/xlog_internal.h"
+#include "common/file_utils.h"
 
 
 /* fd and filename for currently open WAL file */
@@ -68,17 +69,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname(tmppath, false, progname) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath, progname) != 0)
+		return false;
 
 	return true;
 }
@@ -129,6 +126,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname(fn, false, progname) != 0)
+			return false;
+		if (fsync_parent_path(fn, progname) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +164,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname(fn, false, progname) != 0)
+		return false;
+	if (fsync_parent_path(fn, progname) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +235,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn, progname) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +355,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +365,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path, progname) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index b6f62f7..04cd365 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -34,7 +34,7 @@ static void pre_sync_fname(const char *fname, bool isdir,
 						   const char *progname);
 #endif
 static void walkdir(const char *path,
-	void (*action) (const char *fname, bool isdir, const char *progname),
+	int (*action) (const char *fname, bool isdir, const char *progname),
 	bool process_symlinks, const char *progname);
 
 /*
@@ -120,7 +120,7 @@ fsync_pgdata(const char *pg_data, const char *progname)
  */
 static void
 walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir, const char *progname),
+		int (*action) (const char *fname, bool isdir, const char *progname),
 		bool process_symlinks, const char *progname)
 {
 	DIR		   *dir;
@@ -228,7 +228,7 @@ pre_sync_fname(const char *fname, bool isdir, const char *progname)
  * directories on systems where that isn't allowed/required.  Reports
  * other errors non-fatally.
  */
-void
+int
 fsync_fname(const char *fname, bool isdir, const char *progname)
 {
 	int			fd;
@@ -256,10 +256,10 @@ fsync_fname(const char *fname, bool isdir, const char *progname)
 	if (fd < 0)
 	{
 		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
+			return 0;
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
-		return;
+		return -1;
 	}
 
 	returncode = fsync(fd);
@@ -269,8 +269,103 @@ fsync_fname(const char *fname, bool isdir, const char *progname)
 	 * those errors. Anything else needs to be reported.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 				progname, fname, strerror(errno));
+		return -1;
+	}
 
 	(void) close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+int
+fsync_parent_path(const char *fname, const char *progname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname(parentpath, true, progname) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, const char *progname)
+{
+	int		fd;
+
+	/*
+	 * First fsync the old and target path (if it exists), to ensure that they
+	 * are properly persistent on disk. Syncing the target file is not
+	 * strictly necessary, but it makes it easier to reason about crashes;
+	 * because it's then guaranteed that either source or target file exists
+	 * after a crash.
+	 */
+	if (fsync_fname(oldfile, false, progname) != 0)
+		return -1;
+
+	fd = open(newfile, PG_BINARY | O_RDWR, 0);
+	if (fd < 0)
+	{
+		if (errno != ENOENT)
+		{
+			fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+					progname, newfile, strerror(errno));
+			return -1;
+		}
+	}
+	else
+	{
+		if (fsync(fd) != 0)
+		{
+			fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+					progname, newfile, strerror(errno));
+			close(fd);
+			return -1;
+		}
+		close(fd);
+	}
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname(newfile, false, progname) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile, progname) != 0)
+		return -1;
+
+	return 0;
 }
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index d3794df..1cb263d 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -15,8 +15,11 @@
 #ifndef FILE_UTILS_H
 #define FILE_UTILS_H
 
-extern void fsync_fname(const char *fname, bool isdir,
-						const char *progname);
+extern int fsync_fname(const char *fname, bool isdir,
+					   const char *progname);
 extern void fsync_pgdata(const char *pg_data, const char *progname);
+extern int durable_rename(const char *oldfile, const char *newfile,
+						  const char *progname);
+extern int fsync_parent_path(const char *fname, const char *progname);
 
 #endif   /* FILE_UTILS_H */
-- 
2.10.0

0003-Add-nosync-option-to-pg_basebackup.patchapplication/x-patch; name=0003-Add-nosync-option-to-pg_basebackup.patchDownload
From f58b37c61f3181c003c4fec7669d723ec9ebebc5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 24 Sep 2016 00:12:13 +0900
Subject: [PATCH 3/4] Add --nosync option to pg_basebackup

This is useful for testing purposes, similarly to initdb's --nosync.
---
 doc/src/sgml/ref/pg_basebackup.sgml    | 15 +++++++++++++++
 src/bin/pg_basebackup/pg_basebackup.c  | 28 +++++++++++++++++++---------
 src/bin/pg_basebackup/pg_receivexlog.c |  1 +
 src/bin/pg_basebackup/receivelog.c     | 34 ++++++++++++++++++----------------
 src/bin/pg_basebackup/receivelog.h     |  4 +++-
 5 files changed, 56 insertions(+), 26 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 9f1eae1..f73a026 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -439,6 +439,21 @@ PostgreSQL documentation
      </varlistentry>
 
      <varlistentry>
+      <term><option>-N</option></term>
+      <term><option>--nosync</option></term>
+      <listitem>
+       <para>
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.  This option causes
+        <command>pg_basebackup</command> to return without waiting, which is
+        faster, but means that a subsequent operating system crash can leave
+        the base backup corrupt.  Generally, this option is useful for testing
+        but should not be used when creating a production installation.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-v</option></term>
       <term><option>--verbose</option></term>
       <listitem>
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 2266e34..9ded7e0 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -67,6 +67,7 @@ static bool includewal = false;
 static bool streamwal = false;
 static bool fastcheckpoint = false;
 static bool writerecoveryconf = false;
+static bool do_sync = true;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static pg_time_t last_progress_report = 0;
 static int32 maxrate = 0;		/* no limit by default */
@@ -327,6 +328,7 @@ usage(void)
 			 "                         set fast or spread checkpointing\n"));
 	printf(_("  -l, --label=LABEL      set backup label\n"));
 	printf(_("  -n, --noclean          do not clean up after errors\n"));
+	printf(_("  -N, --nosync           do not wait for changes to be written safely to disk\n"));
 	printf(_("  -P, --progress         show progress information\n"));
 	printf(_("  -v, --verbose          output verbose messages\n"));
 	printf(_("  -V, --version          output version information, then exit\n"));
@@ -458,6 +460,7 @@ LogStreamerMain(logstreamer_param *param)
 	stream.stream_stop = reached_end_position;
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = false;
+	stream.do_sync = do_sync;
 	stream.mark_done = true;
 	stream.basedir = param->xlogdir;
 	stream.partial_suffix = NULL;
@@ -1197,7 +1200,7 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		PQfreemem(copybuf);
 
 	/* sync the resulting tar file, errors are not considered fatal */
-	if (strcmp(basedir, "-") != 0)
+	if (do_sync && strcmp(basedir, "-") != 0)
 		(void) fsync_fname(filename, false, progname);
 }
 
@@ -1965,14 +1968,17 @@ BaseBackup(void)
 	 * all the data of the base directory is synced, taking into account
 	 * all the tablespaces. Errors are not considered fatal.
 	 */
-	if (format == 't')
+	if (do_sync)
 	{
-		if (strcmp(basedir, "-") != 0)
-			(void) fsync_fname(basedir, true, progname);
-	}
-	else
-	{
-		(void) fsync_pgdata(basedir, progname);
+		if (format == 't')
+		{
+			if (strcmp(basedir, "-") != 0)
+				(void) fsync_fname(basedir, true, progname);
+		}
+		else
+		{
+			(void) fsync_pgdata(basedir, progname);
+		}
 	}
 
 	if (verbose)
@@ -1999,6 +2005,7 @@ main(int argc, char **argv)
 		{"compress", required_argument, NULL, 'Z'},
 		{"label", required_argument, NULL, 'l'},
 		{"noclean", no_argument, NULL, 'n'},
+		{"nosync", no_argument, NULL, 'N'},
 		{"dbname", required_argument, NULL, 'd'},
 		{"host", required_argument, NULL, 'h'},
 		{"port", required_argument, NULL, 'p'},
@@ -2035,7 +2042,7 @@ main(int argc, char **argv)
 
 	atexit(cleanup_directories_atexit);
 
-	while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:nzZ:d:c:h:p:U:s:S:wWvP",
+	while ((c = getopt_long(argc, argv, "D:F:r:RT:xX:l:nNzZ:d:c:h:p:U:s:S:wWvP",
 							long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -2113,6 +2120,9 @@ main(int argc, char **argv)
 			case 'n':
 				noclean = true;
 				break;
+			case 'N':
+				do_sync = false;
+				break;
 			case 'z':
 #ifdef HAVE_LIBZ
 				compresslevel = Z_DEFAULT_COMPRESSION;
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 7f7ee9d..a58a251 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -336,6 +336,7 @@ StreamLog(void)
 	stream.stream_stop = stop_streaming;
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = synchronous;
+	stream.do_sync = true;
 	stream.mark_done = false;
 	stream.basedir = basedir;
 	stream.partial_suffix = ".partial";
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 546df9a..65ba117 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -38,8 +38,8 @@ static PGresult *HandleCopyStream(PGconn *conn, StreamCtl *stream,
 				 XLogRecPtr *stoppos);
 static int	CopyStreamPoll(PGconn *conn, long timeout_ms);
 static int	CopyStreamReceive(PGconn *conn, long timeout, char **buffer);
-static bool ProcessKeepaliveMsg(PGconn *conn, char *copybuf, int len,
-					XLogRecPtr blockpos, int64 *last_status);
+static bool ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf,
+					int len, XLogRecPtr blockpos, int64 *last_status);
 static bool ProcessXLogDataMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 				   XLogRecPtr *blockpos);
 static PGresult *HandleEndOfCopyStream(PGconn *conn, StreamCtl *stream, char *copybuf,
@@ -53,7 +53,7 @@ static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
 
 static bool
-mark_file_as_archived(const char *basedir, const char *fname)
+mark_file_as_archived(const char *basedir, const char *fname, bool do_sync)
 {
 	int			fd;
 	static char tmppath[MAXPGPATH];
@@ -71,10 +71,10 @@ mark_file_as_archived(const char *basedir, const char *fname)
 
 	close(fd);
 
-	if (fsync_fname(tmppath, false, progname) != 0)
+	if (do_sync && fsync_fname(tmppath, false, progname) != 0)
 		return false;
 
-	if (fsync_parent_path(tmppath, progname) != 0)
+	if (do_sync && fsync_parent_path(tmppath, progname) != 0)
 		return false;
 
 	return true;
@@ -131,9 +131,9 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 		 * fsync, in case of a previous crash between padding and fsyncing the
 		 * file.
 		 */
-		if (fsync_fname(fn, false, progname) != 0)
+		if (stream->do_sync && fsync_fname(fn, false, progname) != 0)
 			return false;
-		if (fsync_parent_path(fn, progname) != 0)
+		if (stream->do_sync && fsync_parent_path(fn, progname) != 0)
 			return false;
 
 		return true;
@@ -170,9 +170,9 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	 * using synchronous mode, where the file is modified and fsynced
 	 * in-place, without a directory fsync.
 	 */
-	if (fsync_fname(fn, false, progname) != 0)
+	if (stream->do_sync && fsync_fname(fn, false, progname) != 0)
 		return false;
-	if (fsync_parent_path(fn, progname) != 0)
+	if (stream->do_sync && fsync_parent_path(fn, progname) != 0)
 		return false;
 
 	if (lseek(f, SEEK_SET, 0) != 0)
@@ -209,7 +209,7 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 		return false;
 	}
 
-	if (fsync(walfile) != 0)
+	if (stream->do_sync && fsync(walfile) != 0)
 	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 				progname, current_walfile_name, strerror(errno));
@@ -255,7 +255,8 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 	if (currpos == XLOG_SEG_SIZE && stream->mark_done)
 	{
 		/* writes error message if failed */
-		if (!mark_file_as_archived(stream->basedir, current_walfile_name))
+		if (!mark_file_as_archived(stream->basedir, current_walfile_name,
+								   stream->do_sync))
 			return false;
 	}
 
@@ -375,7 +376,8 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	if (stream->mark_done)
 	{
 		/* writes error message if failed */
-		if (!mark_file_as_archived(stream->basedir, histfname))
+		if (!mark_file_as_archived(stream->basedir, histfname,
+								   stream->do_sync))
 			return false;
 	}
 
@@ -833,7 +835,7 @@ HandleCopyStream(PGconn *conn, StreamCtl *stream,
 		 */
 		if (stream->synchronous && lastFlushPosition < blockpos && walfile != -1)
 		{
-			if (fsync(walfile) != 0)
+			if (stream->do_sync && fsync(walfile) != 0)
 			{
 				fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 						progname, current_walfile_name, strerror(errno));
@@ -887,7 +889,7 @@ HandleCopyStream(PGconn *conn, StreamCtl *stream,
 			/* Check the message type. */
 			if (copybuf[0] == 'k')
 			{
-				if (!ProcessKeepaliveMsg(conn, copybuf, r, blockpos,
+				if (!ProcessKeepaliveMsg(conn, stream, copybuf, r, blockpos,
 										 &last_status))
 					goto error;
 			}
@@ -1040,7 +1042,7 @@ CopyStreamReceive(PGconn *conn, long timeout, char **buffer)
  * Process the keepalive message.
  */
 static bool
-ProcessKeepaliveMsg(PGconn *conn, char *copybuf, int len,
+ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 					XLogRecPtr blockpos, int64 *last_status)
 {
 	int			pos;
@@ -1076,7 +1078,7 @@ ProcessKeepaliveMsg(PGconn *conn, char *copybuf, int len,
 			 * data has been successfully replicated or not, at the normal
 			 * shutdown of the server.
 			 */
-			if (fsync(walfile) != 0)
+			if (stream->do_sync && fsync(walfile) != 0)
 			{
 				fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
 						progname, current_walfile_name, strerror(errno));
diff --git a/src/bin/pg_basebackup/receivelog.h b/src/bin/pg_basebackup/receivelog.h
index 554ff8b..7a3bbc5 100644
--- a/src/bin/pg_basebackup/receivelog.h
+++ b/src/bin/pg_basebackup/receivelog.h
@@ -34,8 +34,10 @@ typedef struct StreamCtl
 								 * timeline */
 	int			standby_message_timeout;		/* Send status messages this
 												 * often */
-	bool		synchronous;	/* Flush data on write */
+	bool		synchronous;	/* Flush immediately WAL data on write */
 	bool		mark_done;		/* Mark segment as done in generated archive */
+	bool		do_sync;		/* Flush to disk to ensure consistent state
+								 * of data */
 
 	stream_stop_callback stream_stop;	/* Stop streaming when returns true */
 
-- 
2.10.0

0004-Switch-pg_basebackup-commands-in-Postgres.pm-to-use-.patchapplication/x-patch; name=0004-Switch-pg_basebackup-commands-in-Postgres.pm-to-use-.patchDownload
From a07ec1fbf9d6d3f5b935ed8f3974c01bb1bbd166 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 24 Sep 2016 00:13:48 +0900
Subject: [PATCH 4/4] Switch pg_basebackup commands in Postgres.pm to use
 --no-sync

On low-spec machines, this greatly reduces the I/O pressure induced by the
tests.
---
 src/test/perl/PostgresNode.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index fede1e6..15d065b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -476,7 +476,7 @@ sub backup
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-p', $port,
-		'-x');
+		'-x', '--nosync');
 	print "# Backup finished\n";
 }
 
-- 
2.10.0

#22Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#21)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 9/23/16 11:15 AM, Michael Paquier wrote:

0002:

durable_rename needs close(fd) in non-error path (compare backend code).

Oops, leak.

Why did you remove the errno save and restore?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Eisentraut (#22)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Tue, Sep 27, 2016 at 11:16 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 9/23/16 11:15 AM, Michael Paquier wrote:

0002:

durable_rename needs close(fd) in non-error path (compare backend code).

Oops, leak.

Why did you remove the errno save and restore?

That's this bit in durable_rename, patch 0002:
+        if (fsync(fd) != 0)
+        {
+            fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+                    progname, newfile, strerror(errno));
+            close(fd);
+            return -1;
+        }
+        close(fd);
I thought that as long as the error string is shown to the user, it
does not matter much if errno is still saved or not. All the callers
of durable_rename() on frontends don't check for strerrno(errno)
afterwards. Do you think it matters? Changing that back is trivial.
Sorry for not mentioning directly in the thread that I modified that
when dropping the last patch set.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Michael Paquier (#23)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On 9/26/16 10:34 PM, Michael Paquier wrote:

I thought that as long as the error string is shown to the user, it
does not matter much if errno is still saved or not. All the callers
of durable_rename() on frontends don't check for strerrno(errno)
afterwards. Do you think it matters? Changing that back is trivial.
Sorry for not mentioning directly in the thread that I modified that
when dropping the last patch set.

Actually, I think the equivalent backend code only does this to save the
errno across the close call because the elog call comes after the close.
So it's OK to not do that in the frontend.

With that in mind, I have committed the v3 series now.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Eisentraut (#24)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Thu, Sep 29, 2016 at 8:33 AM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 9/26/16 10:34 PM, Michael Paquier wrote:

I thought that as long as the error string is shown to the user, it
does not matter much if errno is still saved or not. All the callers
of durable_rename() on frontends don't check for strerrno(errno)
afterwards. Do you think it matters? Changing that back is trivial.
Sorry for not mentioning directly in the thread that I modified that
when dropping the last patch set.

Actually, I think the equivalent backend code only does this to save the
errno across the close call because the elog call comes after the close.
So it's OK to not do that in the frontend.

With that in mind, I have committed the v3 series now.

I'm getting compiler warnings:

file_utils.c: In function 'fsync_pgdata':
file_utils.c:86: warning: passing argument 2 of 'walkdir' from incompatible
pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char
*)' but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:88: warning: passing argument 2 of 'walkdir' from incompatible
pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char
*)' but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:89: warning: passing argument 2 of 'walkdir' from incompatible
pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char
*)' but argument is of type 'void (*)(const char *, bool, const char *)'

Cheers,

Jeff

#26Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Jeff Janes (#25)
Re: pg_basebackup, pg_receivexlog and data durability

On 9/29/16 12:31 PM, Jeff Janes wrote:

With that in mind, I have committed the v3 series now.

I'm getting compiler warnings:

Fixed.

file_utils.c: In function 'fsync_pgdata':
file_utils.c:86: warning: passing argument 2 of 'walkdir' from
incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char
*)' but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:88: warning: passing argument 2 of 'walkdir' from
incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char
*)' but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:89: warning: passing argument 2 of 'walkdir' from
incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char
*)' but argument is of type 'void (*)(const char *, bool, const char *)'

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Michael Paquier
michael.paquier@gmail.com
In reply to: Jeff Janes (#25)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

On Fri, Sep 30, 2016 at 1:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:33 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 9/26/16 10:34 PM, Michael Paquier wrote:

I thought that as long as the error string is shown to the user, it
does not matter much if errno is still saved or not. All the callers
of durable_rename() on frontends don't check for strerrno(errno)
afterwards. Do you think it matters? Changing that back is trivial.
Sorry for not mentioning directly in the thread that I modified that
when dropping the last patch set.

Actually, I think the equivalent backend code only does this to save the
errno across the close call because the elog call comes after the close.
So it's OK to not do that in the frontend.

With that in mind, I have committed the v3 series now.

Thanks!

I'm getting compiler warnings:

file_utils.c: In function 'fsync_pgdata':
file_utils.c:86: warning: passing argument 2 of 'walkdir' from incompatible
pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)'
but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:88: warning: passing argument 2 of 'walkdir' from incompatible
pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)'
but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:89: warning: passing argument 2 of 'walkdir' from incompatible
pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)'
but argument is of type 'void (*)(const char *, bool, const char *)'

Oops. Are you using gcc to see that? I compiled with clang and on
Windows without noticing it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#27)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

Michael Paquier <michael.paquier@gmail.com> writes:

Oops. Are you using gcc to see that? I compiled with clang and on
Windows without noticing it.

Peter already noted that you'd only see it on platforms that define
PG_FLUSH_DATA_WORKS.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers