fdatasync performance problem with large number of DB files

Started by Michael Brownalmost 5 years ago49 messages

michael.brown@discourse.org

almost 5 years ago

I initially posted this on the pgsql-general mailing list [5]https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html but was
advised to also post this to the -hackers list as it deals with internals.

We've encountered a production performance problem with pg13 related to
how it fsyncs the whole data directory in certain scenarios, related to
what Paul (bcc'ed) described in a post to pgsql-hackers [1]/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com.

Background:

We've observed the full recursive fsync is triggered when

* pg_basebackup receives a streaming backup (via [2]https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181 fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3]https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495 SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4]It should be identical config-wise. It isn't starved for IO but does have other regular write workloads this takes ~90 minutes.

Obviously this is untenable risk. We've modified our process that
bootstraps a replica via pg_basebackup to instead do "pg_basebackup
--no-sync…" followed by a "sync", but we don't have any way to do the
equivalent for the postgres startup.

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

Thanks!

[1]: /messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
/messages/by-id/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw@mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
[3]: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
[4]: It should be identical config-wise. It isn't starved for IO but does have other regular write workloads
does have other regular write workloads
[5]: https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html
https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html

--
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Michael Brown (#1)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3] SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4] this takes ~90 minutes.

Ouch.

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1]/messages/by-id/CA+hUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg@mail.gmail.com in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

I also wrote a WAL-and-checkpoint based prototype[2]/messages/by-id/CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com, which, among
other advantages such as being faster, not ignoring errors and not
triggering collateral write-back storms, happens to work on all
operating systems. On the other hand it requires a somewhat dogmatic
switch in thinking about the meaning of checkpoints (I mean, it
requires humans to promise not to falsify checkpoints by copying
databases around underneath us), which may be hard to sell (I didn't
try very hard), and there may be subtleties I have missed...

[1]: /messages/by-id/CA+hUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg@mail.gmail.com
[2]: /messages/by-id/CA+hUKGKHhDNnN6fxf6qrAx9h+mjdNU2Zmx7ztJzFQ0C5=u3QPg@mail.gmail.com

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#2)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'd probably make it an enum-style GUC, because I intend to do some
more work on my "precise" alternative, though not in time for this
release, and it could just as well be an option too.

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#3)

Re: fdatasync performance problem with large number of DB files

Thomas Munro <thomas.munro@gmail.com> writes:

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#4)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

Right, the patch knows about that:

+    /*
+     * On Linux, we don't have to open every single file one by one.  We can
+     * use syncfs() to sync whole filesystems.  We only expect filesystem
+     * boundaries to exist where we tolerate symlinks, namely pg_wal and the
+     * tablespaces, so we call syncfs() for each of those directories.
+     */

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#3)

Re: fdatasync performance problem with large number of DB files

On 2021/03/11 8:30, Thomas Munro wrote:

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael.brown@discourse.org> wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14. Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.

+1 to push this kind of change into v14!!

I've run into a couple of users who have just commented that recursive
fsync() code out!

BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#6)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:00 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/11 8:30, Thomas Munro wrote:

I've run into a couple of users who have just commented that recursive
fsync() code out!

BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Those users wanted fsync=on because they wanted to recover to a normal
online system after a crash, but they believed that the preceding
fsync of the data directory was useless, because replaying the WAL
should be enough. IMHO they were nearly on the right track, and the
prototype patch I linked earlier as [2] was my attempt to find the
specific reasons why that doesn't work and fix them. So far, I
figured out that you still have to remember to fsync the WAL files
(otherwise you're replaying WAL that potentially hasn't reached the
disk), and data files holding blocks that recovery decided to skip due
to BLK_DONE (otherwise you might decide to skip replay because of a
higher LSN that is on a page that is in the kernel's cache but not yet
on disk).

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Thomas Munro (#5)

Re: fdatasync performance problem with large number of DB files

Thomas Munro <thomas.munro@gmail.com> writes:

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm a little skeptical about the "simple" part. At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems. (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

Right, the patch knows about that:

I noticed that the syncfs man page present in RHEL8 seemed a little
squishy on the critical question of error reporting. It promises
that syncfs will wait for I/O completion, but it doesn't say in so
many words that I/O errors will be reported (and the list of
applicable errno codes is only EBADF, not very reassuring).

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Tom Lane (#8)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Agreed. Perhaps we could also try to do something programmatic about that.

Its fsync() was also pretty rough for the first 28 years.

#10

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#9)

1 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 11, 2021 at 2:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Trolling the net, I found a newer-looking version of the man page,
and behold it says

In mainline kernel versions prior to 5.8, syncfs() will fail only
when passed a bad file descriptor (EBADF). Since Linux 5.8,
syncfs() will also report an error if one or more inodes failed
to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Agreed. Perhaps we could also try to do something programmatic about that.

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

I would plan to extend that to include a third option as already
discussed in the other thread, maybe something like "wal" (= sync WAL
files and then do extra analysis of WAL data to sync only data
modified since checkpoint but not replayed), but that'd be material
for PG15.

Attachments:

v2-0001-Optionally-use-syncfs-for-SyncDataDirectory-on-Li.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Optionally-use-syncfs-for-SyncDataDirectory-on-Li.patchDownload

From e1a14240a08b802b669c7a4d2a7cacc1004a7de4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v2] Optionally use syncfs() for SyncDataDirectory() on Linux.

Opening every file can be slow, as the first step before crash recovery
can begin.  Provide an alternative method, where we call syncfs() on
every possibly different filesystem under the data directory.  This
avoids faulting in inodes for potentially very many inodes.

The option can be controlled with a new GUC:

sync_after_crash={fsync,syncfs}

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure                                     |  2 +-
 configure.ac                                  |  1 +
 doc/src/sgml/config.sgml                      | 28 +++++++++
 src/backend/storage/file/fd.c                 | 62 ++++++++++++++++++-
 src/backend/utils/misc/guc.c                  | 17 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in                    |  3 +
 src/include/storage/fd.h                      |  8 +++
 8 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..f8bd82a729 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9679,6 +9679,34 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-sync-after-crash" xreflabel="sync_after_crash">
+      <term><varname>sync_after_crash</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>sync_after_crash</varname> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
+        is intended to make sure that all WAL and data files are durably stored
+        on disk before replaying changes.
+       </para>
+       <para>
+        On Linux, <literal>syncfs</literal> may be used instead, to ask the
+        operating system to flush the whole file system.  This may be a lot
+        faster, because it doesn't need to open each file one by one.  On the
+        other hand, it may be slower if the file system is shared by other
+        applications that modify a lot of files, since those files will also
+        be flushed to disk.  Furthermore, on version of Linux before 5.8, I/O
+        errors encountered while writing data to disk may not be reported to
+        <productname>PostgreSQL</productname>, and relevant error messages may
+        appear only in kernel logs.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..74554c8617 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include <dirent.h>
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory should do its job. */
+int			sync_after_crash = SYNC_AFTER_CRASH_FSYNC;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -3263,9 +3268,27 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifdef HAVE_SYNCFS
+static void
+do_syncfs(const char *path)
+{
+	int		fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+	{
+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "syncfs failed for %s: %m", path);
+	close(fd);
+}
+#endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents.
+ * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
+ * all potential filesystem, depending on sync_after_crash setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3289,6 +3312,10 @@ void
 SyncDataDirectory(void)
 {
 	bool		xlog_is_symlink;
+#ifdef HAVE_SYNCFS
+	DIR		   *dir;
+	struct dirent *de;
+#endif
 
 	/* We can skip this whole thing if fsync is disabled. */
 	if (!enableFsync)
@@ -3317,6 +3344,39 @@ SyncDataDirectory(void)
 		xlog_is_symlink = true;
 #endif
 
+#ifdef HAVE_SYNCFS
+	if (sync_after_crash == SYNC_AFTER_CRASH_SYNCFS)
+	{
+		/*
+		 * On Linux, we don't have to open every single file one by one.  We
+		 * can use syncfs() to sync whole filesystems.  We only expect
+		 * filesystem boundaries to exist where we tolerate symlinks, namely
+		 * pg_wal and the tablespaces, so we call syncfs() for each of those
+		 * directories.
+		 */
+
+		/* Sync the top level pgdata directory. */
+		do_syncfs(".");
+		/* If any tablespaces are configured, sync each of those. */
+		dir = AllocateDir("pg_tblspc");
+		while ((de = ReadDir(dir, "pg_tblspc")))
+		{
+			char		path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+				continue;
+
+			snprintf(path, MAXPGPATH, "pg_tblspc/%s", de->d_name);
+			do_syncfs(path);
+		}
+		FreeDir(dir);
+		/* If pg_wal is a symlink, process that too. */
+		if (xlog_is_symlink)
+			do_syncfs("pg_wal");
+		return;
+	}
+#endif		/* !HAVE_SYNCFS */
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.  Errors in this step are even less
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..fea4ce6523 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -487,6 +487,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2),
 				 "array length mismatch");
 
+static struct config_enum_entry sync_after_crash_options[] = {
+	{"fsync", SYNC_AFTER_CRASH_FSYNC, false},
+#ifdef HAVE_SYNCFS
+	{"syncfs", SYNC_AFTER_CRASH_SYNCFS, false},
+#endif
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4840,6 +4848,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"sync_after_crash", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},
+		&sync_after_crash,
+		SYNC_AFTER_CRASH_FSYNC, sync_after_crash_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f46c2dd7a8..b9edfd4787 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -757,6 +757,7 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#sync_after_crash = fsync		# fsync or syncfs (Linux only, 5.8+)
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7a7cc21d8d..7bb5743e3d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -590,6 +590,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `syncfs' function. */
+#undef HAVE_SYNCFS
+
 /* Define to 1 if you have the `sync_file_range' function. */
 #undef HAVE_SYNC_FILE_RANGE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..44fc1cb2a9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,13 @@
 
 #include <dirent.h>
 
+typedef enum SyncDataDirectoryMode {
+	SYNC_AFTER_CRASH_FSYNC,
+#ifdef HAVE_SYNCFS
+	SYNC_AFTER_CRASH_SYNCFS,
+#endif
+} SyncDataDirectoryMode;
+
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 typedef int File;
@@ -53,6 +60,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern int sync_after_crash;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
-- 
2.30.1

#11

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#10)

1 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

Attachments:

v3-0001-Optionally-use-syncfs-for-SyncDataDirectory-on-Li.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Optionally-use-syncfs-for-SyncDataDirectory-on-Li.patchDownload

From 5b4a97bd99ba0ce789770c1dea48ea5544e175c7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v3] Optionally use syncfs() for SyncDataDirectory() on Linux.

Opening every file can be slow, as the first step before crash recovery
can begin.  Provide an alternative method, where we call syncfs() on
every possibly different filesystem under the data directory.  This
avoids faulting in inodes for potentially very many inodes.

The option can be controlled with a new GUC:

sync_after_crash={fsync,syncfs}

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure                                     |  2 +-
 configure.ac                                  |  1 +
 doc/src/sgml/config.sgml                      | 28 +++++++++
 src/backend/storage/file/fd.c                 | 62 ++++++++++++++++++-
 src/backend/utils/misc/guc.c                  | 17 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in                    |  3 +
 src/include/storage/fd.h                      |  8 +++
 src/tools/msvc/Solution.pm                    |  1 +
 9 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..cc3843ac81 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9679,6 +9679,34 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-sync-after-crash" xreflabel="sync_after_crash">
+      <term><varname>sync_after_crash</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>sync_after_crash</varname> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
+        is intended to make sure that all WAL and data files are durably stored
+        on disk before replaying changes.
+       </para>
+       <para>
+        On Linux, <literal>syncfs</literal> may be used instead, to ask the
+        operating system to flush the whole file system.  This may be a lot
+        faster, because it doesn't need to open each file one by one.  On the
+        other hand, it may be slower if the file system is shared by other
+        applications that modify a lot of files, since those files will also
+        be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+        errors encountered while writing data to disk may not be reported to
+        <productname>PostgreSQL</productname>, and relevant error messages may
+        appear only in kernel logs.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..461a1a5b07 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include <dirent.h>
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory() should do its job. */
+int			sync_after_crash = SYNC_AFTER_CRASH_FSYNC;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -3263,9 +3268,27 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifdef HAVE_SYNCFS
+static void
+do_syncfs(const char *path)
+{
+	int		fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+	{
+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "syncfs failed for %s: %m", path);
+	close(fd);
+}
+#endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents.
+ * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
+ * all potential filesystem, depending on sync_after_crash setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3289,6 +3312,10 @@ void
 SyncDataDirectory(void)
 {
 	bool		xlog_is_symlink;
+#ifdef HAVE_SYNCFS
+	DIR		   *dir;
+	struct dirent *de;
+#endif
 
 	/* We can skip this whole thing if fsync is disabled. */
 	if (!enableFsync)
@@ -3317,6 +3344,39 @@ SyncDataDirectory(void)
 		xlog_is_symlink = true;
 #endif
 
+#ifdef HAVE_SYNCFS
+	if (sync_after_crash == SYNC_AFTER_CRASH_SYNCFS)
+	{
+		/*
+		 * On Linux, we don't have to open every single file one by one.  We
+		 * can use syncfs() to sync whole filesystems.  We only expect
+		 * filesystem boundaries to exist where we tolerate symlinks, namely
+		 * pg_wal and the tablespaces, so we call syncfs() for each of those
+		 * directories.
+		 */
+
+		/* Sync the top level pgdata directory. */
+		do_syncfs(".");
+		/* If any tablespaces are configured, sync each of those. */
+		dir = AllocateDir("pg_tblspc");
+		while ((de = ReadDir(dir, "pg_tblspc")))
+		{
+			char		path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+				continue;
+
+			snprintf(path, MAXPGPATH, "pg_tblspc/%s", de->d_name);
+			do_syncfs(path);
+		}
+		FreeDir(dir);
+		/* If pg_wal is a symlink, process that too. */
+		if (xlog_is_symlink)
+			do_syncfs("pg_wal");
+		return;
+	}
+#endif		/* !HAVE_SYNCFS */
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.  Errors in this step are even less
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..fea4ce6523 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -487,6 +487,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2),
 				 "array length mismatch");
 
+static struct config_enum_entry sync_after_crash_options[] = {
+	{"fsync", SYNC_AFTER_CRASH_FSYNC, false},
+#ifdef HAVE_SYNCFS
+	{"syncfs", SYNC_AFTER_CRASH_SYNCFS, false},
+#endif
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4840,6 +4848,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"sync_after_crash", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},
+		&sync_after_crash,
+		SYNC_AFTER_CRASH_FSYNC, sync_after_crash_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f46c2dd7a8..b9edfd4787 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -757,6 +757,7 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#sync_after_crash = fsync		# fsync or syncfs (Linux only, 5.8+)
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7a7cc21d8d..7bb5743e3d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -590,6 +590,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `syncfs' function. */
+#undef HAVE_SYNCFS
+
 /* Define to 1 if you have the `sync_file_range' function. */
 #undef HAVE_SYNC_FILE_RANGE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..44fc1cb2a9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,13 @@
 
 #include <dirent.h>
 
+typedef enum SyncDataDirectoryMode {
+	SYNC_AFTER_CRASH_FSYNC,
+#ifdef HAVE_SYNCFS
+	SYNC_AFTER_CRASH_SYNCFS,
+#endif
+} SyncDataDirectoryMode;
+
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 typedef int File;
@@ -53,6 +60,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern int sync_after_crash;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index a4f5cc4bdb..43d171e544 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -388,6 +388,7 @@ sub GenerateFiles
 		HAVE_STRUCT_TM_TM_ZONE                   => undef,
 		HAVE_SYNC_FILE_RANGE                     => undef,
 		HAVE_SYMLINK                             => 1,
+		HAVE_SYNCFS                              => undef,
 		HAVE_SYSLOG                              => undef,
 		HAVE_SYS_EPOLL_H                         => undef,
 		HAVE_SYS_EVENT_H                         => undef,
-- 
2.30.1

#12

Paul Guo

guopa@vmware.com

almost 5 years ago

In reply to: Thomas Munro (#11)

Re: fdatasync performance problem with large number of DB files

> On 2021/3/15, 7:34 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]/messages/by-id/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16@vmware.com), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

[1]: /messages/by-id/25CFBDF2-5551-4CC3-ADEB-434B6B1BAD16@vmware.com

#13

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Paul Guo (#12)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

#14

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#11)

Re: fdatasync performance problem with large number of DB files

On 2021/03/15 8:33, Thomas Munro wrote:

On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Time being of the essence, here is the patch I posted last year, this
time with a GUC and some docs. You can set sync_after_crash to
"fsync" (default) or "syncfs" if you have it.

Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos.

Thanks for the patch!

+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.

Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#15

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#13)

Re: fdatasync performance problem with large number of DB files

On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#16

Paul Guo

paulguo@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#15)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa@vmware.com> wrote:

By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
backup_label(?) does not seem to be good ideas since
1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
copy the pgdata directory and then we still need fsync.
The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...

hostname, mac address, or pgdata path (or e.g. inode of a file?) might
be the same after vm cloning or directory copying though it is not usual.
I can not figure out a stable solution that makes the information is out of
date after vm/directory cloning/copying, so the simplest way seems to
be that leaves the decision (i.e. touching a file) to users, instead of
writing the information automatically by pg_rewind/pg_basebackup.

(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

This scenario seems to be a support to the file touching solution since
we do not have an automatic solution to skip the fsync. I thought using
sync_after_crash=none to fix my issue but as I said we need to reset
the guc since we still expect fsync/syncfs after the 2nd crash.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting

This seems to make the process complex. From the perspective of product design,
this seems to be not attractive.

sync_after_crash=none would be helpful for this case and simple.

Regards,
Paul Guo (Vmware)

#17

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#14)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Thanks for the patch!
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.
Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.

If I insert the following extra sentence after that one, is it better?
"This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

#18

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#15)

Re: fdatasync performance problem with large number of DB files

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

I still do not like this footgun :-) However, perhaps I am being
overly dogmatic. Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?

#19

Paul Guo

paulguo@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#18)

Re: fdatasync performance problem with large number of DB files

On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

I think we could consider reviewing and then pushing the syncfs patch
at this moment since
the fsync issue really affects much although there seems to be a
better plan for this in the future,
it may make the sync step in startup much faster. Today I first
encountered a real
case in a production environment. startup spends >1hour on the fsync
step: I'm pretty
sure that the pgdata is sync-ed, and per startup pstack I saw the
startup process
one by one slowly open(), fsync() (surely do nothing) and close(), and
the pre_sync_fname() is also a burden in such a scenario. So this
issue is really
annoying.

We could discuss further optimizing the special crash recovery
scenario that users
explicitly know the sync step could be skipped (this scenario is
surely not unusual),
even having the patch. The syncfs patch could be used for this
scenario also but the
filesystem might be shared by other applications (this is not unusual
and happens in my
customer's environment for example) so syncfs() is (probably much) slower than
skipping the sync step.

--
Paul Guo (Vmware)

#20

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#17)

Re: fdatasync performance problem with large number of DB files

On 2021/03/17 12:02, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:10 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for the patch!
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.
Isn't this a bit misleading? This may cause users to misunderstand that
such fsync can happen only in the case of crash recovery.
If I insert the following extra sentence after that one, is it better?
"This applies whenever starting a database cluster that did not shut
down cleanly, including copies created with pg_basebackup."

Yes. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#18)

Re: fdatasync performance problem with large number of DB files

On 2021/03/17 12:45, Thomas Munro wrote:

On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/16 8:15, Thomas Munro wrote:

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

Hopefully syncfs() will return quickly in that case, without doing any work?

Yes, in Linux.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

I still do not like this footgun :-) However, perhaps I am being
overly dogmatic. Consider the change in d8179b00, which decided that
I/O errors in this phase should be reported at LOG level rather than
ERROR. In contrast, my "sync_after_crash=wal" mode (which I need to
rebase over this) will PANIC in this case, because any syncing will be
handled through the usual checkpoint codepaths.

Do you think it would be OK to commit this feature with just "fsync"
and "syncfs", and then to continue to consider adding "none" as a
possible separate commit?

+1. "syncfs" feature is useful whether we also support "none" mode or not.
It's good idea to commit "syncfs" in advance.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#22

Paul Guo

guopa@vmware.com

almost 5 years ago

In reply to: Fujii Masao (#21)

Re: fdatasync performance problem with large number of DB files

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

#23

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Paul Guo (#22)

1 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Attachments:

v3-0001-Optionally-use-syncfs-for-SyncDataDirectory-on-Li.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Optionally-use-syncfs-for-SyncDataDirectory-on-Li.patchDownload

From 68491a974fe9690b2c616d965066266ec6727e7d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v3] Optionally use syncfs() for SyncDataDirectory() on Linux.

On some systems, opening every file in the data directory can be very
slow, as the first step before crash recovery can begin.  Provide an
alternative method, where we call syncfs() on every possibly different
filesystem under the data directory.  This avoids faulting in inodes for
potentially very many inodes.  It comes with some caveats, so it's not
the default.

The option can be controlled with a new setting:

sync_method_after_crash={fsync,syncfs}

Reported-by: Michael Brown <michael.brown@discourse.org>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Paul Guo <guopa@vmware.com>
Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure                                     |  2 +-
 configure.ac                                  |  1 +
 doc/src/sgml/config.sgml                      | 30 +++++++++
 src/backend/storage/file/fd.c                 | 61 ++++++++++++++++++-
 src/backend/utils/misc/guc.c                  | 17 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in                    |  3 +
 src/include/storage/fd.h                      |  6 ++
 src/tools/msvc/Solution.pm                    |  1 +
 9 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8603cf3f94..f11af3b763 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9704,6 +9704,36 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-sync-method-after-crash" xreflabel="sync_method_after_crash">
+      <term><varname>sync_method_after_crash</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>sync_method_after_crash</varname> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
+        is intended to make sure that all WAL and data files are durably stored
+        on disk before replaying changes.  This applies whenever starting a
+        database cluster that did not shut down cleanly, including copies
+        created with pg_basebackup.
+       </para>
+       <para>
+        On Linux, <literal>syncfs</literal> may be used instead, to ask the
+        operating system to flush the whole file system.  This may be a lot
+        faster, because it doesn't need to open each file one by one.  On the
+        other hand, it may be slower if the file system is shared by other
+        applications that modify a lot of files, since those files will also
+        be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+        errors encountered while writing data to disk may not be reported to
+        <productname>PostgreSQL</productname>, and relevant error messages may
+        appear only in kernel logs.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..f83a74e969 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include <dirent.h>
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory() should do its job. */
+int			sync_method_after_crash = SYNC_METHOD_AFTER_CRASH_FSYNC;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -3263,9 +3268,27 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifdef HAVE_SYNCFS
+static void
+do_syncfs(const char *path)
+{
+	int		fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+	{
+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "syncfs failed for %s: %m", path);
+	close(fd);
+}
+#endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents.
+ * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
+ * all potential filesystem, depending on sync_method_after_crash setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3317,6 +3340,42 @@ SyncDataDirectory(void)
 		xlog_is_symlink = true;
 #endif
 
+#ifdef HAVE_SYNCFS
+	if (sync_method_after_crash == SYNC_METHOD_AFTER_CRASH_SYNCFS)
+	{
+		DIR		   *dir;
+		struct dirent *de;
+
+		/*
+		 * On Linux, we don't have to open every single file one by one.  We
+		 * can use syncfs() to sync whole filesystems.  We only expect
+		 * filesystem boundaries to exist where we tolerate symlinks, namely
+		 * pg_wal and the tablespaces, so we call syncfs() for each of those
+		 * directories.
+		 */
+
+		/* Sync the top level pgdata directory. */
+		do_syncfs(".");
+		/* If any tablespaces are configured, sync each of those. */
+		dir = AllocateDir("pg_tblspc");
+		while ((de = ReadDir(dir, "pg_tblspc")))
+		{
+			char		path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+				continue;
+
+			snprintf(path, MAXPGPATH, "pg_tblspc/%s", de->d_name);
+			do_syncfs(path);
+		}
+		FreeDir(dir);
+		/* If pg_wal is a symlink, process that too. */
+		if (xlog_is_symlink)
+			do_syncfs("pg_wal");
+		return;
+	}
+#endif		/* !HAVE_SYNCFS */
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.  Errors in this step are even less
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b263e3493b..70d7a2741f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -487,6 +487,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2),
 				 "array length mismatch");
 
+static struct config_enum_entry sync_method_after_crash_options[] = {
+	{"fsync", SYNC_METHOD_AFTER_CRASH_FSYNC, false},
+#ifdef HAVE_SYNCFS
+	{"syncfs", SYNC_METHOD_AFTER_CRASH_SYNCFS, false},
+#endif
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4850,6 +4858,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"sync_method_after_crash", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},
+		&sync_method_after_crash,
+		SYNC_METHOD_AFTER_CRASH_FSYNC, sync_method_after_crash_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6647f8fd6e..747f29393a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -758,6 +758,7 @@
 
 #exit_on_error = off			# terminate session on any error?
 #restart_after_crash = on		# reinitialize after backend crash?
+#sync_method_after_crash = fsync	# fsync or syncfs (Linux only, 5.8+)
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7a7cc21d8d..7bb5743e3d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -590,6 +590,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `syncfs' function. */
+#undef HAVE_SYNCFS
+
 /* Define to 1 if you have the `sync_file_range' function. */
 #undef HAVE_SYNC_FILE_RANGE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..ee26915c06 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,11 @@
 
 #include <dirent.h>
 
+typedef enum SyncMethodAfterCrash {
+	SYNC_METHOD_AFTER_CRASH_FSYNC,
+	SYNC_METHOD_AFTER_CRASH_SYNCFS
+} SyncMethodAfterCrash;
+
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 typedef int File;
@@ -53,6 +58,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern int sync_method_after_crash;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index a4f5cc4bdb..43d171e544 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -388,6 +388,7 @@ sub GenerateFiles
 		HAVE_STRUCT_TM_TM_ZONE                   => undef,
 		HAVE_SYNC_FILE_RANGE                     => undef,
 		HAVE_SYMLINK                             => 1,
+		HAVE_SYNCFS                              => undef,
 		HAVE_SYSLOG                              => undef,
 		HAVE_SYS_EPOLL_H                         => undef,
 		HAVE_SYS_EVENT_H                         => undef,
-- 
2.30.1

#24

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Paul Guo (#19)

Re: fdatasync performance problem with large number of DB files

On Wed, Mar 17, 2021 at 11:42 PM Paul Guo <paulguo@gmail.com> wrote:

I just quickly reviewed the patch (the code part). It looks good. Only
one concern
or question is do_syncfs() for symlink opened fd for syncfs() - I'm
not 100% sure.

Alright, let me try to prove that it works the way we want with an experiment.

I'll make a directory with a file in it, and create a symlink to it in
another filesystem:

tmunro@x1:~/junk$ mkdir my_wal_dir
tmunro@x1:~/junk$ touch my_wal_dir/foo
tmunro@x1:~/junk$ ln -s /home/tmunro/junk/my_wal_dir /dev/shm/my_wal_dir_symlink
tmunro@x1:~/junk$ ls /dev/shm/my_wal_dir_symlink/
foo

Now I'll write a program that repeatedly dirties the first block of
foo, and calls syncfs() on the containing directory that it opened
using the symlink:

tmunro@x1:~/junk$ cat test.c
#define _GNU_SOURCE

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int
main()
{
int symlink_fd, file_fd;

symlink_fd = open("/dev/shm/my_wal_dir_symlink", O_RDONLY);
if (symlink_fd < 0) {
perror("open1");
return EXIT_FAILURE;
}

file_fd = open("/home/tmunro/junk/my_wal_dir/foo", O_RDWR);
if (file_fd < 0) {
perror("open2");
return EXIT_FAILURE;
}

for (int i = 0; i < 4; ++i) {
if (pwrite(file_fd, "hello world", 10, 0) != 10) {
perror("pwrite");
return EXIT_FAILURE;
}
if (syncfs(symlink_fd) < 0) {
perror("syncfs");
return EXIT_FAILURE;
}
sleep(1);
}
return EXIT_SUCCESS;
}
tmunro@x1:~/junk$ cc test.c
tmunro@x1:~/junk$ ./a.out

While that's running, to prove that it does what we want it to do,
I'll first find out where foo lives on the disk:

tmunro@x1:~/junk$ /sbin/xfs_bmap my_wal_dir/foo
my_wal_dir/foo:
0: [0..7]: 242968520..242968527

Now I'll trace the writes going to block 242968520, and start the program again:

tmunro@x1:~/junk$ sudo btrace /dev/nvme0n1p2 | grep 242968520
259,0 4 93 4.157000669 724924 A W 244019144 + 8 <-
(259,2) 242968520
259,0 2 155 5.158446989 718635 A W 244019144 + 8 <-
(259,2) 242968520
259,0 7 23 6.163765728 724924 A W 244019144 + 8 <-
(259,2) 242968520
259,0 7 30 7.169112683 724924 A W 244019144 + 8 <-
(259,2) 242968520

#25

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#23)

Re: fdatasync performance problem with large number of DB files

On 2021/03/18 19:19, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Thanks for updating the patch!

+        database cluster that did not shut down cleanly, including copies
+        created with pg_basebackup.

"pg_basebackup" should be "<application>pg_basebackup</application>"?

+ while ((de = ReadDir(dir, "pg_tblspc")))

The comment of SyncDataDirectory() says "Errors are logged but not
considered fatal". So ReadDirExtended() with LOG level should be used
here, instead?

Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

+ fd = open(path, O_RDONLY);

For current use, it's not harmful to use open() and close(). But isn't
it safer to use OpenTransientFile() and CloseTransientFile(), instead?
Because do_syncfs() may be used for other purpose in the future.

+	if (syncfs(fd) < 0)
+		elog(LOG, "syncfs failed for %s: %m", path);

According to the message style guide, this message should be something
like "could not sync filesystem for \"%s\": %m"?

I confirmed that no error was reported when crash recovery started with
syncfs, in old Linux. I should also confirm that no error is reported in that
case in Linux 5.8+, but I don't have that environement. So I've not tested
this feature in Linux 5.8+....

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#26

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Thomas Munro (#23)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface. I guess it is
"crash recovery".

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#27

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Bruce Momjian (#26)

Re: fdatasync performance problem with large number of DB files

On Thu, Mar 18, 2021 at 09:54:11AM -0400, Bruce Momjian wrote:

On Thu, Mar 18, 2021 at 11:19:13PM +1300, Thomas Munro wrote:

On Thu, Mar 18, 2021 at 8:52 PM Paul Guo <guopa@vmware.com> wrote:

About the syncfs patch, my first impression on the guc name sync_after_crash
is that it is a boolean type. Not sure about other people's feeling. Do you guys think
It is better to rename it to a clearer name like sync_method_after_crash or others?

Works for me. Here is a new version like that, also including the
documentation change discussed with Fujii-san, and a couple of
cosmetic changes.

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface. I guess it is
"crash recovery".

Maybe call it "recovery_sync_method"?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#28

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Bruce Momjian (#27)

Re: fdatasync performance problem with large number of DB files

On 2021/03/18 23:03, Bruce Momjian wrote:

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface.

We have GUC restart_after_crash.

I guess it is

"crash recovery".

Maybe call it "recovery_sync_method"

+1. This name sounds good to me. Or recovery_init_sync_method, because that
sync happens in the initial phase of recovery.

Another idea from different angle is data_directory_sync_method. If we adopt
this, we can easily extend this feature for other use cases (other than sync at
the beginning of recovery) without changing the name.
I'm not sure if such cases actually exist, though.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#29

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#28)

3 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 5:50 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2021/03/18 23:03, Bruce Momjian wrote:

Are we sure we want to use the word "crash" here? I don't remember
seeing it used anywhere else in our user interface.

We have GUC restart_after_crash.

I guess it is

"crash recovery".

Maybe call it "recovery_sync_method"

+1. This name sounds good to me. Or recovery_init_sync_method, because that
sync happens in the initial phase of recovery.

Yeah, I was trying to fit the existing pattern
{restart,remove_temp_files}_after_crash. But
recovery_init_sync_method also sounds good to me. I prefer the
version with "init"... without "init", people might get the wrong idea
about what this controls, so let's try that. Done in the attached
version.

Another idea from different angle is data_directory_sync_method. If we adopt
this, we can easily extend this feature for other use cases (other than sync at
the beginning of recovery) without changing the name.
I'm not sure if such cases actually exist, though.

I can't imagine what -- it's like using a sledge hammer to crack a nut.

(I am aware of a semi-related idea: use the proposed fsinfo() Linux
system call to read the filesystem-wide error counter at every
checkpoint to see if anything bad happened that Linux forgot to tell
us about through the usual channels. That's the same internal
mechanism used by syncfs() to report errors, but last I checked it
hadn't been committed yet. I don't think that's share anything with
this code though.)

From your earlier email:

On Fri, Mar 19, 2021 at 2:12 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

+        database cluster that did not shut down cleanly, including copies
+        created with pg_basebackup.
"pg_basebackup" should be "<application>pg_basebackup</application>"?

Fixed.

+ while ((de = ReadDir(dir, "pg_tblspc")))

The comment of SyncDataDirectory() says "Errors are logged but not
considered fatal". So ReadDirExtended() with LOG level should be used
here, instead?

Fixed.

Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

How could this be useful?

+ fd = open(path, O_RDONLY);

For current use, it's not harmful to use open() and close(). But isn't
it safer to use OpenTransientFile() and CloseTransientFile(), instead?

Ok, done, for consistency with other code.

Because do_syncfs() may be used for other purpose in the future.

I hope not :-)

+       if (syncfs(fd) < 0)
+               elog(LOG, "syncfs failed for %s: %m", path);
According to the message style guide, this message should be something
like "could not sync filesystem for \"%s\": %m"?

Fixed.

I confirmed that no error was reported when crash recovery started with
syncfs, in old Linux. I should also confirm that no error is reported in that
case in Linux 5.8+, but I don't have that environement. So I've not tested
this feature in Linux 5.8+....

I have a Linux 5.10 system. Here's a trace of SyncDataDirectory on a
system that has two tablespaces and has a symlink for pg_wal:

[pid 3861224] lstat("pg_wal", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
[pid 3861224] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4) = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861224] fstat(4, {st_mode=S_IFDIR|0700, st_size=32, ...}) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 4 entries */, 32768) = 112
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16406", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5) = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16407", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 0 entries */, 32768) = 0
[pid 3861224] close(4) = 0
[pid 3861224] openat(AT_FDCWD, "pg_wal", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4) = 0

To see how it looks when syncfs() fails, I added a fake implementation
that fails with EUCLEAN on every second call, and then the output
looks like this:

...
1616111356.663 startup 3879272 LOG: database system was interrupted;
last known up at 2021-03-19 12:48:33 NZDT
1616111356.663 startup 3879272 LOG: could not sync filesystem for
"pg_tblspc/16406": Structure needs cleaning
1616111356.663 startup 3879272 LOG: could not sync filesystem for
"pg_wal": Structure needs cleaning
1616111356.663 startup 3879272 LOG: database system was not properly
shut down; automatic recovery in progress
...

A more common setup with no tablespaces and pg_wal as a non-symlink looks like:

[pid 3861448] lstat("pg_wal", {st_mode=S_IFDIR|0700, st_size=316, ...}) = 0
[pid 3861448] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861448] syncfs(4) = 0
[pid 3861448] close(4) = 0
[pid 3861448] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861448] fstat(4, {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 2 entries */, 32768) = 48
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 0 entries */, 32768) = 0
[pid 3861448] close(4)

The alternative fsync() mode is (unsurprisingly) much longer.

Thanks for the reviews!

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Attachments:

v4-0001-Provide-recovery_init_sync_method-syncfs.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Provide-recovery_init_sync_method-syncfs.patchDownload

From a5faf47f720684bd7952ecf0ca0c495d9ec14989 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v4 1/3] Provide recovery_init_sync_method=syncfs.

On some systems, opening every file in the data directory can be very
slow, as the first step before crash recovery can begin.  Provide an
alternative method, for Linux only, where we call syncfs() on every
possibly different filesystem under the data directory.  This avoids
faulting in inodes for potentially very many inodes.  It comes with some
caveats, so it's not the default.

Reported-by: Michael Brown <michael.brown@discourse.org>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Paul Guo <guopa@vmware.com>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure                                     |  2 +-
 configure.ac                                  |  1 +
 doc/src/sgml/config.sgml                      | 30 +++++++++
 src/backend/storage/file/fd.c                 | 61 ++++++++++++++++++-
 src/backend/utils/misc/guc.c                  | 17 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in                    |  3 +
 src/include/storage/fd.h                      |  6 ++
 src/tools/msvc/Solution.pm                    |  1 +
 9 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 863ac31c6b..32de8235bf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9721,6 +9721,36 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-init-sync-method" xreflabel="recovery_init_sync_method">
+      <term><varname>recovery_init_sync_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>recovery_init_sync_method</varname> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
+        is intended to make sure that all WAL and data files are durably stored
+        on disk before replaying changes.  This applies whenever starting a
+        database cluster that did not shut down cleanly, including copies
+        created with <application>pg_basebackup</application>.
+       </para>
+       <para>
+        On Linux, <literal>syncfs</literal> may be used instead, to ask the
+        operating system to flush the whole file system.  This may be a lot
+        faster, because it doesn't need to open each file one by one.  On the
+        other hand, it may be slower if the file system is shared by other
+        applications that modify a lot of files, since those files will also
+        be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+        errors encountered while writing data to disk may not be reported to
+        <productname>PostgreSQL</productname>, and relevant error messages may
+        appear only in kernel logs.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 110ba31517..14a07f09a4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include <dirent.h>
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory() should do its job. */
+int			recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -3265,9 +3270,27 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifdef HAVE_SYNCFS
+static void
+do_syncfs(const char *path)
+{
+	int		fd;
+
+	fd = OpenTransientFile(path, O_RDONLY);
+	if (fd < 0)
+	{
+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "could not sync filesystem for \"%s\": %m", path);
+	CloseTransientFile(fd);
+}
+#endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents.
+ * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
+ * all potential filesystem, depending on recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3319,6 +3342,42 @@ SyncDataDirectory(void)
 		xlog_is_symlink = true;
 #endif
 
+#ifdef HAVE_SYNCFS
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_SYNCFS)
+	{
+		DIR		   *dir;
+		struct dirent *de;
+
+		/*
+		 * On Linux, we don't have to open every single file one by one.  We
+		 * can use syncfs() to sync whole filesystems.  We only expect
+		 * filesystem boundaries to exist where we tolerate symlinks, namely
+		 * pg_wal and the tablespaces, so we call syncfs() for each of those
+		 * directories.
+		 */
+
+		/* Sync the top level pgdata directory. */
+		do_syncfs(".");
+		/* If any tablespaces are configured, sync each of those. */
+		dir = AllocateDir("pg_tblspc");
+		while ((de = ReadDirExtended(dir, "pg_tblspc", LOG)))
+		{
+			char		path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+				continue;
+
+			snprintf(path, MAXPGPATH, "pg_tblspc/%s", de->d_name);
+			do_syncfs(path);
+		}
+		FreeDir(dir);
+		/* If pg_wal is a symlink, process that too. */
+		if (xlog_is_symlink)
+			do_syncfs("pg_wal");
+		return;
+	}
+#endif		/* !HAVE_SYNCFS */
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.  Errors in this step are even less
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5a3ca5b765..f7aafdf772 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -487,6 +487,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2),
 				 "array length mismatch");
 
+static struct config_enum_entry recovery_init_sync_method_options[] = {
+	{"fsync", RECOVERY_INIT_SYNC_METHOD_FSYNC, false},
+#ifdef HAVE_SYNCFS
+	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
+#endif
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4859,6 +4867,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},
+		&recovery_init_sync_method,
+		RECOVERY_INIT_SYNC_METHOD_FSYNC, recovery_init_sync_method_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3ff507d5f6..76b9e815a2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,6 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7a7cc21d8d..7bb5743e3d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -590,6 +590,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `syncfs' function. */
+#undef HAVE_SYNCFS
+
 /* Define to 1 if you have the `sync_file_range' function. */
 #undef HAVE_SYNC_FILE_RANGE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..328473bdc9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,11 @@
 
 #include <dirent.h>
 
+typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_FSYNC,
+	RECOVERY_INIT_SYNC_METHOD_SYNCFS
+} RecoveryInitSyncMethod;
+
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 typedef int File;
@@ -53,6 +58,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern int recovery_init_sync_method;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index a4f5cc4bdb..43d171e544 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -388,6 +388,7 @@ sub GenerateFiles
 		HAVE_STRUCT_TM_TM_ZONE                   => undef,
 		HAVE_SYNC_FILE_RANGE                     => undef,
 		HAVE_SYMLINK                             => 1,
+		HAVE_SYNCFS                              => undef,
 		HAVE_SYSLOG                              => undef,
 		HAVE_SYS_EPOLL_H                         => undef,
 		HAVE_SYS_EVENT_H                         => undef,
-- 
2.30.1

v4-0002-Provide-recovery_init_sync_method-none.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Provide-recovery_init_sync_method-none.patchDownload

From 9743a09cb3c0817dc0aeedb0df9d8aafb1454f98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 19 Mar 2021 13:06:35 +1300
Subject: [PATCH v4 2/3] Provide recovery_init_sync_method=none.

Although it's rife with hazards and not a good option to use in general,
some expert users may want to be able to disable the initial sync
performed before crash recovery.  Explicitly discouraged in the
documentation.

Discussion https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
---
 doc/src/sgml/config.sgml                      | 8 ++++++++
 src/backend/storage/file/fd.c                 | 4 ++++
 src/backend/utils/misc/guc.c                  | 1 +
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 src/include/storage/fd.h                      | 3 ++-
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 32de8235bf..e547fcebd8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9748,6 +9748,14 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
         <productname>PostgreSQL</productname>, and relevant error messages may
         appear only in kernel logs.
        </para>
+       <para>
+        The option <literal>none</literal> can be specified to skip the step.
+        This is only safe if all buffered data is known to have been flushed
+        to disk already, for example by a tool such as
+        <application>pg_basebackup</application>.  It is not a good idea to
+        leave this option configured permanently in general, as it will
+        affect future crash recovery cycles.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14a07f09a4..63ef7b6a92 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3319,6 +3319,10 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
+	/* Likewise if this specific sync step is disabled. */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE)
+		return;
+
 	/*
 	 * If pg_wal is a symlink, we'll need to recurse into it separately,
 	 * because the first walkdir below will ignore it.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f7aafdf772..f3a8e8e92e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -492,6 +492,7 @@ static struct config_enum_entry recovery_init_sync_method_options[] = {
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
 #endif
+	{"none", RECOVERY_INIT_SYNC_METHOD_NONE, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 76b9e815a2..8b1610f6ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,7 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
-#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+), none
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 328473bdc9..dca44c38c4 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -47,7 +47,8 @@
 
 typedef enum RecoveryInitSyncMethod {
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
-	RECOVERY_INIT_SYNC_METHOD_SYNCFS
+	RECOVERY_INIT_SYNC_METHOD_SYNCFS,
+	RECOVERY_INIT_SYNC_METHOD_NONE
 } RecoveryInitSyncMethod;
 
 struct iovec;					/* avoid including port/pg_iovec.h here */
-- 
2.30.1

v4-0003-Provide-recovery_init_sync_method-wal.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Provide-recovery_init_sync_method-wal.patchDownload

From f0424c1247cdf7340a01829ec4a616a5fca154c0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 26 Sep 2020 23:07:48 +1200
Subject: [PATCH v4 3/3] Provide recovery_init_sync_method=wal.

It's important to avoid relying on dirty data in the kernel's cache when
deciding that a change has already been applied.

The existing options have disadvantages:

 * recovery_init_sync_method=fsync has to open every file, which
   can be very slow for millions of files on high latency storage.
 * recovery_init_sync_method=syncfs is Linux-only, doesn't report
   errors on older versions and also syncs data outside pgdata
 * recovery_init_sync_method=none requires human action to make
   sure data is synced, and even when that's done correctly, it applies
   to all future crash recovery cycles until the setting is changed

This new option syncs only the WAL files, and then relies on analysis of
the WAL during replay.  We start by assuming that modifications from
before the last checkpoint were already flushed by the checkpoint.  That
is true if you crashed or ran pg_basebackup to the present location, but
false if you made a second copy with non-syncing tools.  Then, if
full_page_writes=on, we expect all data modified since the last
checkpoint to be written to disk as part of the end-of-recovery
checkpoint.  If full_page_writes=off, things are slightly trickier: we
skip updating pages that have an LSN higher than a WAL record
("BLK_DONE"), but we don't trust any data we read from the kernel's
cache to be really on disk, so we skip replay but we still make a note
to sync the file at the next checkpoint anyway.

(A slightly more paranoid mode would instead mark skipped pages dirty
instead, to trigger a rewrite of all pages we skip, if our thread model
includes operating systems that might mark their own page cache pages
"clean" rather than invalidating them after a writeback failure.  That
is not done in this commit.)

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 doc/src/sgml/config.sgml               |  8 +++-
 src/backend/access/transam/xlog.c      | 51 ++++++++++++++++++--------
 src/backend/access/transam/xlogutils.c | 11 ++++++
 src/backend/storage/buffer/bufmgr.c    | 20 ++++++++++
 src/backend/storage/file/fd.c          | 12 ++++--
 src/backend/storage/smgr/md.c          | 15 ++++++++
 src/backend/storage/smgr/smgr.c        | 15 ++++++++
 src/backend/utils/misc/guc.c           |  3 +-
 src/include/storage/bufmgr.h           |  1 +
 src/include/storage/fd.h               |  1 +
 src/include/storage/md.h               |  1 +
 src/include/storage/smgr.h             |  2 +
 12 files changed, 118 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e547fcebd8..7d4d59a02f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9729,7 +9729,13 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </term>
       <listitem>
        <para>
-        When set to <literal>fsync</literal>, which is the default,
+        When set to <literal>wal</literal>, which is the default,
+        <productname>PostgreSQL</productname> will fsync all WAL files before
+        recovery begins, and rely on WAL replay to make sure all modified
+        data files are handle correctly.
+       </para>
+       <para>
+        When set to <literal>fsync</literal>,
         <productname>PostgreSQL</productname> will recursively open and fsync
         all files in the data directory before crash recovery begins.  This
         is intended to make sure that all WAL and data files are durably stored
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..d997e400fc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -927,7 +927,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveTempXlogFiles(void);
+static void ScanXlogFilesAfterCrash(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 						   XLogSegNo *endlogSegNo);
@@ -4030,13 +4030,15 @@ UpdateLastRemovedPtr(char *filename)
 }
 
 /*
- * Remove all temporary log files in pg_wal
+ * Remove all temporary log files in pg_wal, and make sure that all remaining
+ * files are down on disk before we replay anything in them if
+ * recovery_init_sync_method requires that.
  *
  * This is called at the beginning of recovery after a previous crash,
  * at a point where no other processes write fresh WAL data.
  */
 static void
-RemoveTempXlogFiles(void)
+ScanXlogFilesAfterCrash(void)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -4048,12 +4050,23 @@ RemoveTempXlogFiles(void)
 	{
 		char		path[MAXPGPATH];
 
-		if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0)
-			continue;
-
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
-		unlink(path);
-		elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+
+		if (strncmp(xlde->d_name, "xlogtemp.", 9) == 0)
+		{
+			unlink(path);
+			elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+		}
+		else if (IsXLogFileName(xlde->d_name) &&
+				 recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+		{
+			/*
+			 * Make sure that whatever we read from this WAL file is durably on
+			 * disk before replaying in RECOVERY_INIT_SYNC_METHOD_WAL mode.
+			 * Necessary because SyncDataDirectory() won't do that for us.
+			 */
+			fsync_fname(path, false);
+		}
 	}
 	FreeDir(xldir);
 }
@@ -6547,23 +6560,29 @@ StartupXLOG(void)
 	ValidateXLOGDirectoryStructure();
 
 	/*----------
-	 * If we previously crashed, perform a couple of actions:
+	 * If we previously crashed:
 	 *
 	 * - The pg_wal directory may still include some temporary WAL segments
 	 *   used when creating a new segment, so perform some clean up to not
-	 *   bloat this path.  This is done first as there is no point to sync
-	 *   this temporary data.
+	 *   bloat this path.
+	 *
+	 * - It's possible that some WAL data had been written but not yet synced.
+	 *   Therefore we have to sync these files before replaying the records
+	 *   they contain.  How we do that depends on on the
+	 *    recovery_init_wal_sync_method setting.
 	 *
 	 * - There might be data which we had written, intending to fsync it, but
-	 *   which we had not actually fsync'd yet.  Therefore, a power failure in
-	 *   the near future might cause earlier unflushed writes to be lost, even
-	 *   though more recent data written to disk from here on would be
-	 *   persisted.  To avoid that, fsync the entire data directory.
+	 *   which we had not actually fsync'd yet.  If
+	 *   recovery_init_wal_sync_method=wal, then XLogReadBufferForRedo() will
+	 *   compute the set of files that may need fsyncing at the next
+	 *   checkpoint, during recovery.  Otherwise, we'll sync the entire data
+	 *   directory to avoid relying on data from the kernel's cache that hasn't
+	 *   reached disk yet.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
 	{
-		RemoveTempXlogFiles();
+		ScanXlogFilesAfterCrash();
 		SyncDataDirectory();
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..593c50ee3f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -401,7 +401,18 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 					LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 			}
 			if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+			{
+				/*
+				 * The page is from the future.  We must be in crash recovery.
+				 * We don't need to redo the record, but we do need to make
+				 * sure that the image as we have seen it is durably stored on
+				 * disk as part of the next checkpoint, unless that was already
+				 * done by SyncDataDirectory().
+				 */
+				if (recovery_init_sync_method != RECOVERY_INIT_SYNC_METHOD_WAL)
+					RequestBufferSync(*buf);
 				return BLK_DONE;
+			}
 			else
 				return BLK_NEEDS_REDO;
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..6393545f0c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1530,6 +1530,26 @@ MarkBufferDirty(Buffer buffer)
 	}
 }
 
+/*
+ * Request that the file containing a buffer be fsynced at the next checkpoint.
+ * This is only used in crash recovery, to make it safe to skip applying WAL on
+ * the basis that the page already has a change.  We want to make sure that the
+ * data we read from the kernel matches what's on disk at the next checkpoint.
+ */
+void
+RequestBufferSync(Buffer buffer)
+{
+	SMgrRelation reln;
+	BufferDesc *bufHdr;
+
+	Assert(BufferIsPinned(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+	smgrsync(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum);
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 63ef7b6a92..e20ba9c00b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -336,6 +336,7 @@ static void walkdir(const char *path,
 					void (*action) (const char *fname, bool isdir, int elevel),
 					bool process_symlinks,
 					int elevel);
+
 #ifdef PG_FLUSH_DATA_WORKS
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
@@ -3290,7 +3291,7 @@ do_syncfs(const char *path)
 
 /*
  * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * all potential filesystem, depending on the recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3319,8 +3320,12 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
-	/* Likewise if this specific sync step is disabled. */
-	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE)
+	/*
+	 * Likewise if this specific sync step is disabled, or if we're using WAL
+	 * analysis instead.
+	 */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE ||
+		recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
 		return;
 
 	/*
@@ -3478,7 +3483,6 @@ walkdir(const char *path,
 		(*action) (path, true, elevel);
 }
 
-
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..d74506be31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -961,6 +961,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+/*
+ *	mdsync() -- Schedule a sync at the next checkpoint.
+ *
+ * This is useful in crash recovery, to ensure that data we've read from the
+ * kernel matches what is on disk before a checkpoint.
+ */
+void
+mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec	   *seg;
+
+	seg = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	register_dirty_segment(reln, forknum, seg);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..6856c40300 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_sync) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -79,6 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
+		.smgr_sync = mdsync,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
@@ -540,6 +543,18 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 											nblocks);
 }
 
+
+/*
+ *	smgrsync() -- Request that the file containing a block be flushed at the
+ *				  next checkpoint.
+ */
+void
+smgrsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	smgrsw[reln->smgr_which].smgr_sync(reln, forknum, blocknum);
+}
+
+
 /*
  *	smgrnblocks() -- Calculate the number of blocks in the
  *					 supplied relation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f3a8e8e92e..6e73e7b38c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -488,6 +488,7 @@ StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2)
 				 "array length mismatch");
 
 static struct config_enum_entry recovery_init_sync_method_options[] = {
+	{"wal", RECOVERY_INIT_SYNC_METHOD_WAL, false},
 	{"fsync", RECOVERY_INIT_SYNC_METHOD_FSYNC, false},
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
@@ -4873,7 +4874,7 @@ static struct config_enum ConfigureNamesEnum[] =
 			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
 		},
 		&recovery_init_sync_method,
-		RECOVERY_INIT_SYNC_METHOD_FSYNC, recovery_init_sync_method_options,
+		RECOVERY_INIT_SYNC_METHOD_WAL, recovery_init_sync_method_options,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..f5409264da 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -186,6 +186,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
+extern void RequestBufferSync(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index dca44c38c4..801b2a4601 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,6 +46,7 @@
 #include <dirent.h>
 
 typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_WAL,
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
 	RECOVERY_INIT_SYNC_METHOD_SYNCFS,
 	RECOVERY_INIT_SYNC_METHOD_NONE
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..35e813410a 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -40,6 +40,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..4a9cc9f181 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -103,6 +103,8 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrsync(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
-- 
2.30.1

#30

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Thomas Munro (#29)

3 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Erm... I attached the wrong version by mistake. Here's a better one.
(Note: I'm not expecting any review of the 0003 patch in this CF, I
just wanted to share the future direction I'd like to consider for
this problem.)

Attachments:

v5-0001-Provide-recovery_init_sync_method-syncfs.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Provide-recovery_init_sync_method-syncfs.patchDownload

From a5faf47f720684bd7952ecf0ca0c495d9ec14989 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v5 1/3] Provide recovery_init_sync_method=syncfs.

On some systems, opening every file in the data directory can be very
slow, as the first step before crash recovery can begin.  Provide an
alternative method, for Linux only, where we call syncfs() on every
possibly different filesystem under the data directory.  This avoids
faulting in inodes for potentially very many inodes.  It comes with some
caveats, so it's not the default.

Reported-by: Michael Brown <michael.brown@discourse.org>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Paul Guo <guopa@vmware.com>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure                                     |  2 +-
 configure.ac                                  |  1 +
 doc/src/sgml/config.sgml                      | 30 +++++++++
 src/backend/storage/file/fd.c                 | 61 ++++++++++++++++++-
 src/backend/utils/misc/guc.c                  | 17 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in                    |  3 +
 src/include/storage/fd.h                      |  6 ++
 src/tools/msvc/Solution.pm                    |  1 +
 9 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 863ac31c6b..32de8235bf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9721,6 +9721,36 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-init-sync-method" xreflabel="recovery_init_sync_method">
+      <term><varname>recovery_init_sync_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>recovery_init_sync_method</varname> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
+        is intended to make sure that all WAL and data files are durably stored
+        on disk before replaying changes.  This applies whenever starting a
+        database cluster that did not shut down cleanly, including copies
+        created with <application>pg_basebackup</application>.
+       </para>
+       <para>
+        On Linux, <literal>syncfs</literal> may be used instead, to ask the
+        operating system to flush the whole file system.  This may be a lot
+        faster, because it doesn't need to open each file one by one.  On the
+        other hand, it may be slower if the file system is shared by other
+        applications that modify a lot of files, since those files will also
+        be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+        errors encountered while writing data to disk may not be reported to
+        <productname>PostgreSQL</productname>, and relevant error messages may
+        appear only in kernel logs.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 110ba31517..14a07f09a4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include <dirent.h>
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory() should do its job. */
+int			recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -3265,9 +3270,27 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifdef HAVE_SYNCFS
+static void
+do_syncfs(const char *path)
+{
+	int		fd;
+
+	fd = OpenTransientFile(path, O_RDONLY);
+	if (fd < 0)
+	{
+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "could not sync filesystem for \"%s\": %m", path);
+	CloseTransientFile(fd);
+}
+#endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents.
+ * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
+ * all potential filesystem, depending on recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3319,6 +3342,42 @@ SyncDataDirectory(void)
 		xlog_is_symlink = true;
 #endif
 
+#ifdef HAVE_SYNCFS
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_SYNCFS)
+	{
+		DIR		   *dir;
+		struct dirent *de;
+
+		/*
+		 * On Linux, we don't have to open every single file one by one.  We
+		 * can use syncfs() to sync whole filesystems.  We only expect
+		 * filesystem boundaries to exist where we tolerate symlinks, namely
+		 * pg_wal and the tablespaces, so we call syncfs() for each of those
+		 * directories.
+		 */
+
+		/* Sync the top level pgdata directory. */
+		do_syncfs(".");
+		/* If any tablespaces are configured, sync each of those. */
+		dir = AllocateDir("pg_tblspc");
+		while ((de = ReadDirExtended(dir, "pg_tblspc", LOG)))
+		{
+			char		path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+				continue;
+
+			snprintf(path, MAXPGPATH, "pg_tblspc/%s", de->d_name);
+			do_syncfs(path);
+		}
+		FreeDir(dir);
+		/* If pg_wal is a symlink, process that too. */
+		if (xlog_is_symlink)
+			do_syncfs("pg_wal");
+		return;
+	}
+#endif		/* !HAVE_SYNCFS */
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.  Errors in this step are even less
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5a3ca5b765..f7aafdf772 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -487,6 +487,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2),
 				 "array length mismatch");
 
+static struct config_enum_entry recovery_init_sync_method_options[] = {
+	{"fsync", RECOVERY_INIT_SYNC_METHOD_FSYNC, false},
+#ifdef HAVE_SYNCFS
+	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
+#endif
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4859,6 +4867,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},
+		&recovery_init_sync_method,
+		RECOVERY_INIT_SYNC_METHOD_FSYNC, recovery_init_sync_method_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3ff507d5f6..76b9e815a2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,6 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7a7cc21d8d..7bb5743e3d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -590,6 +590,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `syncfs' function. */
+#undef HAVE_SYNCFS
+
 /* Define to 1 if you have the `sync_file_range' function. */
 #undef HAVE_SYNC_FILE_RANGE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..328473bdc9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,11 @@
 
 #include <dirent.h>
 
+typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_FSYNC,
+	RECOVERY_INIT_SYNC_METHOD_SYNCFS
+} RecoveryInitSyncMethod;
+
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 typedef int File;
@@ -53,6 +58,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern int recovery_init_sync_method;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index a4f5cc4bdb..43d171e544 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -388,6 +388,7 @@ sub GenerateFiles
 		HAVE_STRUCT_TM_TM_ZONE                   => undef,
 		HAVE_SYNC_FILE_RANGE                     => undef,
 		HAVE_SYMLINK                             => 1,
+		HAVE_SYNCFS                              => undef,
 		HAVE_SYSLOG                              => undef,
 		HAVE_SYS_EPOLL_H                         => undef,
 		HAVE_SYS_EVENT_H                         => undef,
-- 
2.30.1

v5-0002-Provide-recovery_init_sync_method-none.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Provide-recovery_init_sync_method-none.patchDownload

From 9743a09cb3c0817dc0aeedb0df9d8aafb1454f98 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 19 Mar 2021 13:06:35 +1300
Subject: [PATCH v5 2/3] Provide recovery_init_sync_method=none.

Although it's rife with hazards and not a good option to use in general,
some expert users may want to be able to disable the initial sync
performed before crash recovery.  Explicitly discouraged in the
documentation.

Discussion https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
---
 doc/src/sgml/config.sgml                      | 8 ++++++++
 src/backend/storage/file/fd.c                 | 4 ++++
 src/backend/utils/misc/guc.c                  | 1 +
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 src/include/storage/fd.h                      | 3 ++-
 5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 32de8235bf..e547fcebd8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9748,6 +9748,14 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
         <productname>PostgreSQL</productname>, and relevant error messages may
         appear only in kernel logs.
        </para>
+       <para>
+        The option <literal>none</literal> can be specified to skip the step.
+        This is only safe if all buffered data is known to have been flushed
+        to disk already, for example by a tool such as
+        <application>pg_basebackup</application>.  It is not a good idea to
+        leave this option configured permanently in general, as it will
+        affect future crash recovery cycles.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14a07f09a4..63ef7b6a92 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3319,6 +3319,10 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
+	/* Likewise if this specific sync step is disabled. */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE)
+		return;
+
 	/*
 	 * If pg_wal is a symlink, we'll need to recurse into it separately,
 	 * because the first walkdir below will ignore it.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f7aafdf772..f3a8e8e92e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -492,6 +492,7 @@ static struct config_enum_entry recovery_init_sync_method_options[] = {
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
 #endif
+	{"none", RECOVERY_INIT_SYNC_METHOD_NONE, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 76b9e815a2..8b1610f6ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,7 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
-#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+), none
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 328473bdc9..dca44c38c4 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -47,7 +47,8 @@
 
 typedef enum RecoveryInitSyncMethod {
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
-	RECOVERY_INIT_SYNC_METHOD_SYNCFS
+	RECOVERY_INIT_SYNC_METHOD_SYNCFS,
+	RECOVERY_INIT_SYNC_METHOD_NONE
 } RecoveryInitSyncMethod;
 
 struct iovec;					/* avoid including port/pg_iovec.h here */
-- 
2.30.1

v5-0003-Provide-recovery_init_sync_method-wal.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Provide-recovery_init_sync_method-wal.patchDownload

From 04970c321e9969862f3a3f2107657dc3aa6469d9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 26 Sep 2020 23:07:48 +1200
Subject: [PATCH v5 3/3] Provide recovery_init_sync_method=wal.

It's important to avoid relying on dirty data in the kernel's cache when
deciding that a change has already been applied.

The existing options have disadvantages:

 * recovery_init_sync_method=fsync has to open every file, which
   can be very slow for millions of files on high latency storage.
 * recovery_init_sync_method=syncfs is Linux-only, doesn't report
   errors on older versions and also syncs data outside pgdata
 * recovery_init_sync_method=none requires human action to make
   sure data is synced, and even when that's done correctly, it applies
   to all future crash recovery cycles until the setting is changed

This new option syncs only the WAL files, and then relies on analysis of
the WAL during replay.  We start by assuming that modifications from
before the last checkpoint were already flushed by the checkpoint.  That
is true if you crashed or ran pg_basebackup to the present location, but
false if you made a second copy with non-syncing tools.  Then, if
full_page_writes=on, we expect all data modified since the last
checkpoint to be written to disk as part of the end-of-recovery
checkpoint.  If full_page_writes=off, things are slightly trickier: we
skip updating pages that have an LSN higher than a WAL record
("BLK_DONE"), but we don't trust any data we read from the kernel's
cache to be really on disk, so we skip replay but we still make a note
to sync the file at the next checkpoint anyway.

(A slightly more paranoid mode would mark skipped pages dirty instead,
to trigger a rewrite of all pages we skip, if our threat model includes
operating systems that might mark their own page cache pages "clean"
rather than invalidating them after a writeback failure.  That is not
done in this commit.)

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  8 ++-
 src/backend/access/transam/xlog.c             | 52 +++++++++++++------
 src/backend/access/transam/xlogutils.c        | 11 ++++
 src/backend/storage/buffer/bufmgr.c           | 20 +++++++
 src/backend/storage/file/fd.c                 | 12 +++--
 src/backend/storage/smgr/md.c                 | 15 ++++++
 src/backend/storage/smgr/smgr.c               | 15 ++++++
 src/backend/utils/misc/guc.c                  |  3 +-
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/fd.h                      |  1 +
 src/include/storage/md.h                      |  1 +
 src/include/storage/smgr.h                    |  2 +
 13 files changed, 120 insertions(+), 23 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e547fcebd8..7d4d59a02f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9729,7 +9729,13 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </term>
       <listitem>
        <para>
-        When set to <literal>fsync</literal>, which is the default,
+        When set to <literal>wal</literal>, which is the default,
+        <productname>PostgreSQL</productname> will fsync all WAL files before
+        recovery begins, and rely on WAL replay to make sure all modified
+        data files are handle correctly.
+       </para>
+       <para>
+        When set to <literal>fsync</literal>,
         <productname>PostgreSQL</productname> will recursively open and fsync
         all files in the data directory before crash recovery begins.  This
         is intended to make sure that all WAL and data files are durably stored
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..89678e6470 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -927,7 +927,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveTempXlogFiles(void);
+static void ScanXlogFilesAfterCrash(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 						   XLogSegNo *endlogSegNo);
@@ -4030,13 +4030,15 @@ UpdateLastRemovedPtr(char *filename)
 }
 
 /*
- * Remove all temporary log files in pg_wal
+ * Remove all temporary log files in pg_wal, and make sure that all remaining
+ * files are down on disk before we replay anything in them if
+ * recovery_init_sync_method requires that.
  *
  * This is called at the beginning of recovery after a previous crash,
  * at a point where no other processes write fresh WAL data.
  */
 static void
-RemoveTempXlogFiles(void)
+ScanXlogFilesAfterCrash(void)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -4048,12 +4050,23 @@ RemoveTempXlogFiles(void)
 	{
 		char		path[MAXPGPATH];
 
-		if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0)
-			continue;
-
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
-		unlink(path);
-		elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+
+		if (strncmp(xlde->d_name, "xlogtemp.", 9) == 0)
+		{
+			unlink(path);
+			elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+		}
+		else if (IsXLogFileName(xlde->d_name) &&
+				 recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+		{
+			/*
+			 * Make sure that whatever we read from this WAL file is durably on
+			 * disk before replaying in RECOVERY_INIT_SYNC_METHOD_WAL mode.
+			 * Necessary because SyncDataDirectory() won't do that for us.
+			 */
+			fsync_fname(path, false);
+		}
 	}
 	FreeDir(xldir);
 }
@@ -6547,23 +6560,30 @@ StartupXLOG(void)
 	ValidateXLOGDirectoryStructure();
 
 	/*----------
-	 * If we previously crashed, perform a couple of actions:
+	 * If we previously crashed:
 	 *
 	 * - The pg_wal directory may still include some temporary WAL segments
 	 *   used when creating a new segment, so perform some clean up to not
-	 *   bloat this path.  This is done first as there is no point to sync
-	 *   this temporary data.
+	 *   bloat this path, in ScanXlogFilesAfterCrash().
+	 *
+	 * - It's possible that some WAL data had been written but not yet synced.
+	 *   Therefore we have to sync these files before replaying the records
+	 *   they contain.  If recovery_init_wal_sync_method=wal we'll do that
+	 *   in ScanXlogFilesAfterCrash(); otherwise we'll leave it up to
+	 *   SyncDataDirectory().
 	 *
 	 * - There might be data which we had written, intending to fsync it, but
-	 *   which we had not actually fsync'd yet.  Therefore, a power failure in
-	 *   the near future might cause earlier unflushed writes to be lost, even
-	 *   though more recent data written to disk from here on would be
-	 *   persisted.  To avoid that, fsync the entire data directory.
+	 *   which we had not actually fsync'd yet.  If
+	 *   recovery_init_wal_sync_method=wal, then XLogReadBufferForRedo() will
+	 *   compute the set of files that may need fsyncing at the next
+	 *   checkpoint, during recovery.  Otherwise, SyncDataDirectory() will
+	 *   sync the entire pgdata directory up front, to avoid relying on data
+	 *   from the kernel's cache that hasn't reached disk yet.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
 	{
-		RemoveTempXlogFiles();
+		ScanXlogFilesAfterCrash();
 		SyncDataDirectory();
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..85303e761e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -401,7 +401,18 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 					LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 			}
 			if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+			{
+				/*
+				 * The page is from the future.  We must be in crash recovery.
+				 * We don't need to redo the record, but we do need to make
+				 * sure that the image as we have seen it is durably stored on
+				 * disk as part of the next checkpoint, unless that was already
+				 * done by SyncDataDirectory().
+				 */
+				if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+					RequestBufferSync(*buf);
 				return BLK_DONE;
+			}
 			else
 				return BLK_NEEDS_REDO;
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..6393545f0c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1530,6 +1530,26 @@ MarkBufferDirty(Buffer buffer)
 	}
 }
 
+/*
+ * Request that the file containing a buffer be fsynced at the next checkpoint.
+ * This is only used in crash recovery, to make it safe to skip applying WAL on
+ * the basis that the page already has a change.  We want to make sure that the
+ * data we read from the kernel matches what's on disk at the next checkpoint.
+ */
+void
+RequestBufferSync(Buffer buffer)
+{
+	SMgrRelation reln;
+	BufferDesc *bufHdr;
+
+	Assert(BufferIsPinned(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+	smgrsync(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum);
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 63ef7b6a92..e20ba9c00b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -336,6 +336,7 @@ static void walkdir(const char *path,
 					void (*action) (const char *fname, bool isdir, int elevel),
 					bool process_symlinks,
 					int elevel);
+
 #ifdef PG_FLUSH_DATA_WORKS
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
@@ -3290,7 +3291,7 @@ do_syncfs(const char *path)
 
 /*
  * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * all potential filesystem, depending on the recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3319,8 +3320,12 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
-	/* Likewise if this specific sync step is disabled. */
-	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE)
+	/*
+	 * Likewise if this specific sync step is disabled, or if we're using WAL
+	 * analysis instead.
+	 */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE ||
+		recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
 		return;
 
 	/*
@@ -3478,7 +3483,6 @@ walkdir(const char *path,
 		(*action) (path, true, elevel);
 }
 
-
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..d74506be31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -961,6 +961,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+/*
+ *	mdsync() -- Schedule a sync at the next checkpoint.
+ *
+ * This is useful in crash recovery, to ensure that data we've read from the
+ * kernel matches what is on disk before a checkpoint.
+ */
+void
+mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec	   *seg;
+
+	seg = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	register_dirty_segment(reln, forknum, seg);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..6856c40300 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_sync) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -79,6 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
+		.smgr_sync = mdsync,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
@@ -540,6 +543,18 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 											nblocks);
 }
 
+
+/*
+ *	smgrsync() -- Request that the file containing a block be flushed at the
+ *				  next checkpoint.
+ */
+void
+smgrsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	smgrsw[reln->smgr_which].smgr_sync(reln, forknum, blocknum);
+}
+
+
 /*
  *	smgrnblocks() -- Calculate the number of blocks in the
  *					 supplied relation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f3a8e8e92e..6e73e7b38c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -488,6 +488,7 @@ StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2)
 				 "array length mismatch");
 
 static struct config_enum_entry recovery_init_sync_method_options[] = {
+	{"wal", RECOVERY_INIT_SYNC_METHOD_WAL, false},
 	{"fsync", RECOVERY_INIT_SYNC_METHOD_FSYNC, false},
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
@@ -4873,7 +4874,7 @@ static struct config_enum ConfigureNamesEnum[] =
 			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
 		},
 		&recovery_init_sync_method,
-		RECOVERY_INIT_SYNC_METHOD_FSYNC, recovery_init_sync_method_options,
+		RECOVERY_INIT_SYNC_METHOD_WAL, recovery_init_sync_method_options,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8b1610f6ec..96107d641e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,7 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
-#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+), none
+#recovery_init_sync_method = wal	# wal, fsync, syncfs (Linux 5.8+), none
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..f5409264da 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -186,6 +186,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
+extern void RequestBufferSync(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index dca44c38c4..801b2a4601 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,6 +46,7 @@
 #include <dirent.h>
 
 typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_WAL,
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
 	RECOVERY_INIT_SYNC_METHOD_SYNCFS,
 	RECOVERY_INIT_SYNC_METHOD_NONE
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..35e813410a 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -40,6 +40,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..4a9cc9f181 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -103,6 +103,8 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrsync(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
-- 
2.30.1

#31

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#30)

Re: fdatasync performance problem with large number of DB files

On 2021/03/19 10:37, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Erm... I attached the wrong version by mistake. Here's a better one.

Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.

+		elog(LOG, "could not open %s: %m", path);
+		return;
+	}
+	if (syncfs(fd) < 0)
+		elog(LOG, "could not sync filesystem for \"%s\": %m", path);

Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,

ereport(LOG,
(errcode_for_file_access(),
errmsg("could not open \"%s\": %m", path)))

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#32

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Fujii Masao (#31)

Re: fdatasync performance problem with large number of DB files

On 2021/03/19 11:22, Fujii Masao wrote:

On 2021/03/19 10:37, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 2:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Erm... I attached the wrong version by mistake. Here's a better one.

0002 patch looks good to me. Thanks!
I have minor comments.

- * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * Issue fsync recursively on PGDATA and all its contents, issue syncfs for
+ * all potential filesystem, or do nothing, depending on
+ * recovery_init_sync_method setting.

The comment in SyncDataDirectory() should be updated so that
it mentions "none" method, as the above?

+        This is only safe if all buffered data is known to have been flushed
+        to disk already, for example by a tool such as
+        <application>pg_basebackup</application>.  It is not a good idea to

Isn't it better to add something like "without <literal>--no-sync</literal>"
to "pg_basebackup" part? Which would prevent users from misunderstanding
that pg_basebackup always ensures that whatever options are specified.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#33

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#32)

3 attachment(s)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.
+               elog(LOG, "could not open %s: %m", path);
+               return;
+       }
+       if (syncfs(fd) < 0)
+               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,

Fixed.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

0002 patch looks good to me. Thanks!
I have minor comments.

Ok, I made the changes you suggested. Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

Attachments:

v6-0001-Provide-recovery_init_sync_method-syncfs.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Provide-recovery_init_sync_method-syncfs.patchDownload

From c19c74cb50255141db4b8f823f2867eadb1de422 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 27 Sep 2020 17:22:10 +1300
Subject: [PATCH v6 1/3] Provide recovery_init_sync_method=syncfs.

On some systems, opening every file in the data directory can be very
slow, as the first step before crash recovery can begin.  Provide an
alternative method, for Linux only, where we call syncfs() on every
possibly different filesystem under the data directory.  This avoids
faulting in potentially many inodes from potentially slow storage.  It
comes with some caveats, so it's not the default.

Reported-by: Michael Brown <michael.brown@discourse.org>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Paul Guo <guopa@vmware.com>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 configure                                     |  2 +-
 configure.ac                                  |  1 +
 doc/src/sgml/config.sgml                      | 30 +++++++++
 src/backend/storage/file/fd.c                 | 65 ++++++++++++++++++-
 src/backend/utils/misc/guc.c                  | 17 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/pg_config.h.in                    |  3 +
 src/include/storage/fd.h                      |  6 ++
 src/tools/msvc/Solution.pm                    |  1 +
 9 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 3fd4cecbeb..6a2051da9d 100755
--- a/configure
+++ b/configure
@@ -15239,7 +15239,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2f1585adc0..dd819ab2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	strchrnul
 	strsignal
 	symlink
+	syncfs
 	sync_file_range
 	uselocale
 	wcstombs_l
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 863ac31c6b..32de8235bf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9721,6 +9721,36 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-recovery-init-sync-method" xreflabel="recovery_init_sync_method">
+      <term><varname>recovery_init_sync_method</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>recovery_init_sync_method</varname> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When set to <literal>fsync</literal>, which is the default,
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
+        is intended to make sure that all WAL and data files are durably stored
+        on disk before replaying changes.  This applies whenever starting a
+        database cluster that did not shut down cleanly, including copies
+        created with <application>pg_basebackup</application>.
+       </para>
+       <para>
+        On Linux, <literal>syncfs</literal> may be used instead, to ask the
+        operating system to flush the whole file system.  This may be a lot
+        faster, because it doesn't need to open each file one by one.  On the
+        other hand, it may be slower if the file system is shared by other
+        applications that modify a lot of files, since those files will also
+        be flushed to disk.  Furthermore, on versions of Linux before 5.8, I/O
+        errors encountered while writing data to disk may not be reported to
+        <productname>PostgreSQL</productname>, and relevant error messages may
+        appear only in kernel logs.
+       </para>
+      </listitem>
+     </varlistentry>
+
     </variablelist>
 
    </sect1>
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 110ba31517..28933f8bbe 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -72,9 +72,11 @@
 
 #include "postgres.h"
 
+#include <dirent.h>
 #include <sys/file.h>
 #include <sys/param.h>
 #include <sys/stat.h>
+#include <sys/types.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -158,6 +160,9 @@ int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
+/* How SyncDataDirectory() should do its job. */
+int			recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC;
+
 /* Debugging.... */
 
 #ifdef FDDEBUG
@@ -3265,9 +3270,31 @@ looks_like_temp_rel_name(const char *name)
 	return true;
 }
 
+#ifdef HAVE_SYNCFS
+static void
+do_syncfs(const char *path)
+{
+	int		fd;
+
+	fd = OpenTransientFile(path, O_RDONLY);
+	if (fd < 0)
+	{
+		ereport(LOG,
+				(errcode_for_file_access(),
+				 errmsg("could not open %s: %m", path)));
+		return;
+	}
+	if (syncfs(fd) < 0)
+		ereport(LOG,
+				(errcode_for_file_access(),
+				 errmsg("could not sync filesystem for \"%s\": %m", path)));
+	CloseTransientFile(fd);
+}
+#endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents.
+ * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
+ * all potential filesystem, depending on recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3319,6 +3346,42 @@ SyncDataDirectory(void)
 		xlog_is_symlink = true;
 #endif
 
+#ifdef HAVE_SYNCFS
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_SYNCFS)
+	{
+		DIR		   *dir;
+		struct dirent *de;
+
+		/*
+		 * On Linux, we don't have to open every single file one by one.  We
+		 * can use syncfs() to sync whole filesystems.  We only expect
+		 * filesystem boundaries to exist where we tolerate symlinks, namely
+		 * pg_wal and the tablespaces, so we call syncfs() for each of those
+		 * directories.
+		 */
+
+		/* Sync the top level pgdata directory. */
+		do_syncfs(".");
+		/* If any tablespaces are configured, sync each of those. */
+		dir = AllocateDir("pg_tblspc");
+		while ((de = ReadDirExtended(dir, "pg_tblspc", LOG)))
+		{
+			char		path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+				continue;
+
+			snprintf(path, MAXPGPATH, "pg_tblspc/%s", de->d_name);
+			do_syncfs(path);
+		}
+		FreeDir(dir);
+		/* If pg_wal is a symlink, process that too. */
+		if (xlog_is_symlink)
+			do_syncfs("pg_wal");
+		return;
+	}
+#endif		/* !HAVE_SYNCFS */
+
 	/*
 	 * If possible, hint to the kernel that we're soon going to fsync the data
 	 * directory and its contents.  Errors in this step are even less
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 997b4b70ee..8679063e5f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -487,6 +487,14 @@ const struct config_enum_entry ssl_protocol_versions_info[] = {
 StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2),
 				 "array length mismatch");
 
+static struct config_enum_entry recovery_init_sync_method_options[] = {
+	{"fsync", RECOVERY_INIT_SYNC_METHOD_FSYNC, false},
+#ifdef HAVE_SYNCFS
+	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
+#endif
+	{NULL, 0, false}
+};
+
 static struct config_enum_entry shared_memory_options[] = {
 #ifndef WIN32
 	{"sysv", SHMEM_TYPE_SYSV, false},
@@ -4859,6 +4867,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},
+		&recovery_init_sync_method,
+		RECOVERY_INIT_SYNC_METHOD_FSYNC, recovery_init_sync_method_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3ff507d5f6..76b9e815a2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,6 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 7a7cc21d8d..7bb5743e3d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -590,6 +590,9 @@
 /* Define to 1 if you have the `symlink' function. */
 #undef HAVE_SYMLINK
 
+/* Define to 1 if you have the `syncfs' function. */
+#undef HAVE_SYNCFS
+
 /* Define to 1 if you have the `sync_file_range' function. */
 #undef HAVE_SYNC_FILE_RANGE
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..328473bdc9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,11 @@
 
 #include <dirent.h>
 
+typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_FSYNC,
+	RECOVERY_INIT_SYNC_METHOD_SYNCFS
+} RecoveryInitSyncMethod;
+
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 typedef int File;
@@ -53,6 +58,7 @@ typedef int File;
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
+extern int recovery_init_sync_method;
 
 /*
  * This is private to fd.c, but exported for save/restore_backend_variables()
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index a4f5cc4bdb..43d171e544 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -388,6 +388,7 @@ sub GenerateFiles
 		HAVE_STRUCT_TM_TM_ZONE                   => undef,
 		HAVE_SYNC_FILE_RANGE                     => undef,
 		HAVE_SYMLINK                             => 1,
+		HAVE_SYNCFS                              => undef,
 		HAVE_SYSLOG                              => undef,
 		HAVE_SYS_EPOLL_H                         => undef,
 		HAVE_SYS_EVENT_H                         => undef,
-- 
2.30.1

v6-0002-Provide-recovery_init_sync_method-none.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Provide-recovery_init_sync_method-none.patchDownload

From 77ec7a6a192e227899c80a873c4479c866024587 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 19 Mar 2021 13:06:35 +1300
Subject: [PATCH v6 2/3] Provide recovery_init_sync_method=none.

Although it's rife with hazards and not a good option to use in general,
some expert users may want to be able to disable the initial sync
performed before crash recovery.  Explicitly discouraged in the
documentation.

Discussion https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
---
 doc/src/sgml/config.sgml                      | 9 +++++++++
 src/backend/storage/file/fd.c                 | 9 +++++++--
 src/backend/utils/misc/guc.c                  | 1 +
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 src/include/storage/fd.h                      | 3 ++-
 5 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 32de8235bf..0d058de261 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9748,6 +9748,15 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
         <productname>PostgreSQL</productname>, and relevant error messages may
         appear only in kernel logs.
        </para>
+       <para>
+        The option <literal>none</literal> can be specified to skip this step.
+        This is only safe if all buffered data is known to have been flushed
+        to disk already, for example by a tool such as
+        <application>pg_basebackup</application> without
+        <literal>--no-sync</literal>.  It is not a good idea to
+        leave this option configured permanently in general, as it will
+        affect future crash recovery cycles.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 28933f8bbe..b59fbc4cfe 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3293,8 +3293,9 @@ do_syncfs(const char *path)
 #endif
 
 /*
- * Issue fsync recursively on PGDATA and all its contents, or issue syncfs for
- * all potential filesystem, depending on recovery_init_sync_method setting.
+ * Issue fsync recursively on PGDATA and all its contents, issue syncfs for all
+ * potential filesystem, or do nothing, depending on the
+ * recovery_init_sync_method setting.
  *
  * We fsync regular files and directories wherever they are, but we
  * follow symlinks only for pg_wal and immediately under pg_tblspc.
@@ -3323,6 +3324,10 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
+	/* Likewise if this specific sync step is disabled. */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE)
+		return;
+
 	/*
 	 * If pg_wal is a symlink, we'll need to recurse into it separately,
 	 * because the first walkdir below will ignore it.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8679063e5f..4fa67a5160 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -492,6 +492,7 @@ static struct config_enum_entry recovery_init_sync_method_options[] = {
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
 #endif
+	{"none", RECOVERY_INIT_SYNC_METHOD_NONE, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 76b9e815a2..8b1610f6ec 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,7 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
-#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+)
+#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+), none
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 328473bdc9..dca44c38c4 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -47,7 +47,8 @@
 
 typedef enum RecoveryInitSyncMethod {
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
-	RECOVERY_INIT_SYNC_METHOD_SYNCFS
+	RECOVERY_INIT_SYNC_METHOD_SYNCFS,
+	RECOVERY_INIT_SYNC_METHOD_NONE
 } RecoveryInitSyncMethod;
 
 struct iovec;					/* avoid including port/pg_iovec.h here */
-- 
2.30.1

v6-0003-Provide-recovery_init_sync_method-wal.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Provide-recovery_init_sync_method-wal.patchDownload

From e60b926e12327369e69956cfb4d13b5bb8ad2c3e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 26 Sep 2020 23:07:48 +1200
Subject: [PATCH v6 3/3] Provide recovery_init_sync_method=wal.

It's important to avoid relying on dirty data in the kernel's cache when
deciding that a change has already been applied.

The existing options have disadvantages:

 * recovery_init_sync_method=fsync has to open every file, which
   can be very slow for millions of files on high latency storage.
 * recovery_init_sync_method=syncfs is Linux-only, doesn't report
   errors on older versions and also syncs data outside pgdata
 * recovery_init_sync_method=none requires human action to make
   sure data is synced, and even when that's done correctly, it applies
   to all future crash recovery cycles until the setting is changed

This new option syncs only the WAL files, and then relies on analysis of
the WAL during replay.  We start by assuming that modifications from
before the last checkpoint were already flushed by the checkpoint.  That
is true if you crashed or ran pg_basebackup to the present location, but
false if you made a second copy with non-syncing tools.  Then, if
full_page_writes=on, we expect all data modified since the last
checkpoint to be written to disk as part of the end-of-recovery
checkpoint.  If full_page_writes=off, things are slightly trickier: we
skip updating pages that have an LSN higher than a WAL record
("BLK_DONE"), but we don't trust any data we read from the kernel's
cache to be really on disk, so we skip replay but we still make a note
to sync the file at the next checkpoint anyway.

(A slightly more paranoid mode would mark skipped pages dirty instead,
to trigger a rewrite of all pages we skip, if our threat model includes
operating systems that might mark their own page cache pages "clean"
rather than invalidating them after a writeback failure.  That is not
done in this commit.)

XXX This is proposed for v15.  Need to determine (1) whether there are any
other files under pgdata that need to be explicity fsync'd, (2) whether
it's OK to assume that everything covered by the last checkpoint really
is flushed to disk (in other words, that we're restarting a crashed
server or a pg_basebackup that didn't use the --no-sync option, but not
a copy of a crashed cluster that someone made with rsync/tar/cp -r/...).

Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org
Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  8 ++-
 src/backend/access/transam/xlog.c             | 52 +++++++++++++------
 src/backend/access/transam/xlogutils.c        | 11 ++++
 src/backend/storage/buffer/bufmgr.c           | 20 +++++++
 src/backend/storage/file/fd.c                 | 10 ++--
 src/backend/storage/smgr/md.c                 | 15 ++++++
 src/backend/storage/smgr/smgr.c               | 15 ++++++
 src/backend/utils/misc/guc.c                  |  3 +-
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/storage/bufmgr.h                  |  1 +
 src/include/storage/fd.h                      |  1 +
 src/include/storage/md.h                      |  1 +
 src/include/storage/smgr.h                    |  2 +
 13 files changed, 119 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0d058de261..e239916a84 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9729,7 +9729,13 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </term>
       <listitem>
        <para>
-        When set to <literal>fsync</literal>, which is the default,
+        When set to <literal>wal</literal>, which is the default,
+        <productname>PostgreSQL</productname> will fsync all WAL files before
+        recovery begins, and rely on WAL replay to make sure all modified
+        data files are handle correctly.
+       </para>
+       <para>
+        When set to <literal>fsync</literal>,
         <productname>PostgreSQL</productname> will recursively open and fsync
         all files in the data directory before crash recovery begins.  This
         is intended to make sure that all WAL and data files are durably stored
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..89678e6470 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -927,7 +927,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveTempXlogFiles(void);
+static void ScanXlogFilesAfterCrash(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 						   XLogSegNo *endlogSegNo);
@@ -4030,13 +4030,15 @@ UpdateLastRemovedPtr(char *filename)
 }
 
 /*
- * Remove all temporary log files in pg_wal
+ * Remove all temporary log files in pg_wal, and make sure that all remaining
+ * files are down on disk before we replay anything in them if
+ * recovery_init_sync_method requires that.
  *
  * This is called at the beginning of recovery after a previous crash,
  * at a point where no other processes write fresh WAL data.
  */
 static void
-RemoveTempXlogFiles(void)
+ScanXlogFilesAfterCrash(void)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -4048,12 +4050,23 @@ RemoveTempXlogFiles(void)
 	{
 		char		path[MAXPGPATH];
 
-		if (strncmp(xlde->d_name, "xlogtemp.", 9) != 0)
-			continue;
-
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
-		unlink(path);
-		elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+
+		if (strncmp(xlde->d_name, "xlogtemp.", 9) == 0)
+		{
+			unlink(path);
+			elog(DEBUG2, "removed temporary WAL segment \"%s\"", path);
+		}
+		else if (IsXLogFileName(xlde->d_name) &&
+				 recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+		{
+			/*
+			 * Make sure that whatever we read from this WAL file is durably on
+			 * disk before replaying in RECOVERY_INIT_SYNC_METHOD_WAL mode.
+			 * Necessary because SyncDataDirectory() won't do that for us.
+			 */
+			fsync_fname(path, false);
+		}
 	}
 	FreeDir(xldir);
 }
@@ -6547,23 +6560,30 @@ StartupXLOG(void)
 	ValidateXLOGDirectoryStructure();
 
 	/*----------
-	 * If we previously crashed, perform a couple of actions:
+	 * If we previously crashed:
 	 *
 	 * - The pg_wal directory may still include some temporary WAL segments
 	 *   used when creating a new segment, so perform some clean up to not
-	 *   bloat this path.  This is done first as there is no point to sync
-	 *   this temporary data.
+	 *   bloat this path, in ScanXlogFilesAfterCrash().
+	 *
+	 * - It's possible that some WAL data had been written but not yet synced.
+	 *   Therefore we have to sync these files before replaying the records
+	 *   they contain.  If recovery_init_wal_sync_method=wal we'll do that
+	 *   in ScanXlogFilesAfterCrash(); otherwise we'll leave it up to
+	 *   SyncDataDirectory().
 	 *
 	 * - There might be data which we had written, intending to fsync it, but
-	 *   which we had not actually fsync'd yet.  Therefore, a power failure in
-	 *   the near future might cause earlier unflushed writes to be lost, even
-	 *   though more recent data written to disk from here on would be
-	 *   persisted.  To avoid that, fsync the entire data directory.
+	 *   which we had not actually fsync'd yet.  If
+	 *   recovery_init_wal_sync_method=wal, then XLogReadBufferForRedo() will
+	 *   compute the set of files that may need fsyncing at the next
+	 *   checkpoint, during recovery.  Otherwise, SyncDataDirectory() will
+	 *   sync the entire pgdata directory up front, to avoid relying on data
+	 *   from the kernel's cache that hasn't reached disk yet.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
 	{
-		RemoveTempXlogFiles();
+		ScanXlogFilesAfterCrash();
 		SyncDataDirectory();
 	}
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d17d660f46..85303e761e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -401,7 +401,18 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 					LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
 			}
 			if (lsn <= PageGetLSN(BufferGetPage(*buf)))
+			{
+				/*
+				 * The page is from the future.  We must be in crash recovery.
+				 * We don't need to redo the record, but we do need to make
+				 * sure that the image as we have seen it is durably stored on
+				 * disk as part of the next checkpoint, unless that was already
+				 * done by SyncDataDirectory().
+				 */
+				if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
+					RequestBufferSync(*buf);
 				return BLK_DONE;
+			}
 			else
 				return BLK_NEEDS_REDO;
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..6393545f0c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1530,6 +1530,26 @@ MarkBufferDirty(Buffer buffer)
 	}
 }
 
+/*
+ * Request that the file containing a buffer be fsynced at the next checkpoint.
+ * This is only used in crash recovery, to make it safe to skip applying WAL on
+ * the basis that the page already has a change.  We want to make sure that the
+ * data we read from the kernel matches what's on disk at the next checkpoint.
+ */
+void
+RequestBufferSync(Buffer buffer)
+{
+	SMgrRelation reln;
+	BufferDesc *bufHdr;
+
+	Assert(BufferIsPinned(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+	reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+	smgrsync(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum);
+}
+
 /*
  * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer()
  *
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b59fbc4cfe..77e1bd68b2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -336,6 +336,7 @@ static void walkdir(const char *path,
 					void (*action) (const char *fname, bool isdir, int elevel),
 					bool process_symlinks,
 					int elevel);
+
 #ifdef PG_FLUSH_DATA_WORKS
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
@@ -3324,8 +3325,12 @@ SyncDataDirectory(void)
 	if (!enableFsync)
 		return;
 
-	/* Likewise if this specific sync step is disabled. */
-	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE)
+	/*
+	 * Likewise if this specific sync step is disabled, or if we're using WAL
+	 * analysis instead.
+	 */
+	if (recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_NONE ||
+		recovery_init_sync_method == RECOVERY_INIT_SYNC_METHOD_WAL)
 		return;
 
 	/*
@@ -3483,7 +3488,6 @@ walkdir(const char *path,
 		(*action) (path, true, elevel);
 }
 
-
 /*
  * Hint to the OS that it should get ready to fsync() this file.
  *
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..d74506be31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -961,6 +961,21 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+/*
+ *	mdsync() -- Schedule a sync at the next checkpoint.
+ *
+ * This is useful in crash recovery, to ensure that data we've read from the
+ * kernel matches what is on disk before a checkpoint.
+ */
+void
+mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec	   *seg;
+
+	seg = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
+	register_dirty_segment(reln, forknum, seg);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..6856c40300 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+	void		(*smgr_sync) (SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
 } f_smgr;
 
 static const f_smgr smgrsw[] = {
@@ -79,6 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
+		.smgr_sync = mdsync,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
@@ -540,6 +543,18 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 											nblocks);
 }
 
+
+/*
+ *	smgrsync() -- Request that the file containing a block be flushed at the
+ *				  next checkpoint.
+ */
+void
+smgrsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	smgrsw[reln->smgr_which].smgr_sync(reln, forknum, blocknum);
+}
+
+
 /*
  *	smgrnblocks() -- Calculate the number of blocks in the
  *					 supplied relation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4fa67a5160..c6d0a2eed3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -488,6 +488,7 @@ StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2)
 				 "array length mismatch");
 
 static struct config_enum_entry recovery_init_sync_method_options[] = {
+	{"wal", RECOVERY_INIT_SYNC_METHOD_WAL, false},
 	{"fsync", RECOVERY_INIT_SYNC_METHOD_FSYNC, false},
 #ifdef HAVE_SYNCFS
 	{"syncfs", RECOVERY_INIT_SYNC_METHOD_SYNCFS, false},
@@ -4873,7 +4874,7 @@ static struct config_enum ConfigureNamesEnum[] =
 			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
 		},
 		&recovery_init_sync_method,
-		RECOVERY_INIT_SYNC_METHOD_FSYNC, recovery_init_sync_method_options,
+		RECOVERY_INIT_SYNC_METHOD_WAL, recovery_init_sync_method_options,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8b1610f6ec..96107d641e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -760,7 +760,7 @@
 #restart_after_crash = on		# reinitialize after backend crash?
 #remove_temp_files_after_crash = on	# remove temporary files after
 					# backend crash?
-#recovery_init_sync_method = fsync	# fsync, syncfs (Linux 5.8+), none
+#recovery_init_sync_method = wal	# wal, fsync, syncfs (Linux 5.8+), none
 #data_sync_retry = off			# retry or panic on failure to fsync
 					# data?
 					# (change requires restart)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..f5409264da 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -186,6 +186,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
+extern void RequestBufferSync(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index dca44c38c4..801b2a4601 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,6 +46,7 @@
 #include <dirent.h>
 
 typedef enum RecoveryInitSyncMethod {
+	RECOVERY_INIT_SYNC_METHOD_WAL,
 	RECOVERY_INIT_SYNC_METHOD_FSYNC,
 	RECOVERY_INIT_SYNC_METHOD_SYNCFS,
 	RECOVERY_INIT_SYNC_METHOD_NONE
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..35e813410a 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -40,6 +40,7 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void mdsync(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..4a9cc9f181 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -103,6 +103,8 @@ extern BlockNumber smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
 						 int nforks, BlockNumber *nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrsync(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum);
 extern void AtEOXact_SMgr(void);
 
 #endif							/* SMGR_H */
-- 
2.30.1

#34

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Thomas Munro (#33)

Re: fdatasync performance problem with large number of DB files

On 2021/03/19 14:28, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.
+               elog(LOG, "could not open %s: %m", path);
+               return;
+       }
+       if (syncfs(fd) < 0)
+               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,
Fixed.

Thanks! LGTM.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

Understood.

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

0002 patch looks good to me. Thanks!
I have minor comments.

Ok, I made the changes you suggested.

Thanks! LGTM.

Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

Agreed. I also want to hear more opinion about the setting "none".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#35

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Thomas Munro (#33)

Re: fdatasync performance problem with large number of DB files

On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote:

+++ b/doc/src/sgml/config.sgml

+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This

Maybe it should say "data, tablespace, and WAL directories", or just "critical
directories".

+	{
+		{"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+			gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+		},

"and tablespaces and WAL"

--
Justin

#36

David Steele

david@pgmasters.net

almost 5 years ago

In reply to: Thomas Munro (#33)

Re: fdatasync performance problem with large number of DB files

On 3/19/21 1:28 AM, Thomas Munro wrote:

On Fri, Mar 19, 2021 at 3:23 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
Thanks for updating the patch! It looks good to me!
I have one minor comment for the patch.
+               elog(LOG, "could not open %s: %m", path);
+               return;
+       }
+       if (syncfs(fd) < 0)
+               elog(LOG, "could not sync filesystem for \"%s\": %m", path);
Since these are neither internal errors nor low-level debug messages, ereport() should be used for them rather than elog()? For example,
Fixed.

I'll let this sit until tomorrow to collect any other feedback or
objections, and then push the 0001 patch
(recovery_init_sync_method=syncfs).

I had a look at the patch and it looks good to me.

Should we mention in the docs that the contents of non-standard symlinks
will not be synced, i.e. anything other than tablespaces and pg_wal? Or
can we point them to some docs saying not to do that (if such exists)?

On Fri, Mar 19, 2021 at 4:08 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

0002 patch looks good to me. Thanks!
I have minor comments.

Ok, I made the changes you suggested. Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).

Regards,
--
-David
david@pgmasters.net

#37

Thomas Munro

thomas.munro@gmail.com

almost 5 years ago

In reply to: David Steele (#36)

Re: fdatasync performance problem with large number of DB files

Thanks Justin and David. Replies to two emails inline:

On Sat, Mar 20, 2021 at 12:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Fri, Mar 19, 2021 at 06:28:46PM +1300, Thomas Munro wrote:
+++ b/doc/src/sgml/config.sgml
+        <productname>PostgreSQL</productname> will recursively open and fsync
+        all files in the data directory before crash recovery begins.  This
Maybe it should say "data, tablespace, and WAL directories", or just "critical
directories".

Fair point. Here's what I went with:

When set to <literal>fsync</literal>, which is the default,
<productname>PostgreSQL</productname> will recursively open and
synchronize all files in the data directory before crash
recovery
begins. The search for files will follow symbolic links for the WAL
directory and each configured tablespace (but not any other symbolic
links).

+     {
+             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+             },

"and tablespaces and WAL"

I feel like that's getting too detailed for the GUC description?

On Sat, Mar 20, 2021 at 2:55 AM David Steele <david@pgmasters.net> wrote:

I had a look at the patch and it looks good to me.

Thanks!

Should we mention in the docs that the contents of non-standard symlinks
will not be synced, i.e. anything other than tablespaces and pg_wal? Or
can we point them to some docs saying not to do that (if such exists)?

Good idea. See above for the adjustment I went with to describe the
traditional behaviour, and then I also made a similar change to the
syncfs part, which, I hope, manages to convey that the new mode
matches the existing policy on symlinks:

On Linux, <literal>syncfs</literal> may be used instead, to ask the
operating system to synchronize the whole file systems that contain the
data directory, the WAL file and each tablespace (but not any other
file systems that may be reachable through symbolic links).

I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to. I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly, but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it). There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.

Ok, I made the changes you suggested. Let's see if anyone else would
like to vote for or against the concept of the 0002 patch
(recovery_init_sync_method=none).

It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).

Thanks. I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.

I pushed the 0001 patch. Thanks to all who reviewed. Of course,
further documentation improvement patches are always welcome.

(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)

#38

David Steele

david@pgmasters.net

almost 5 years ago

In reply to: Thomas Munro (#37)

Re: fdatasync performance problem with large number of DB files

On 3/19/21 7:16 PM, Thomas Munro wrote:

Thanks Justin and David. Replies to two emails inline:

Fair point. Here's what I went with:

When set to <literal>fsync</literal>, which is the default,
<productname>PostgreSQL</productname> will recursively open and
synchronize all files in the data directory before crash
recovery
begins. The search for files will follow symbolic links for the WAL
directory and each configured tablespace (but not any other symbolic
links).

I thought about adding some text along the lines that such symlinks
are not expected, but I think you're right that what we really need is
a good place to point to. I mean, generally you can't mess around
with the files managed by PostgreSQL and expect everything to keep
working correctly

WRT to symlinks I'm not sure that's fair to say. From PG's perspective
it's just a dir/file after all. Other than pg_wal I have seen
pg_stat/pg_stat_tmp sometimes symlinked, plus config files, and the log dir.

pgBackRest takes a pretty liberal approach here. Were preserve all
dir/file symlinks no matter where they appear and allow all of them to
be remapped on restore.

but it wouldn't hurt to make an explicit statement
about symlinks and where they're allowed (or maybe there is one
already and I failed to find it).

I couldn't find it either and I would be in favor of it. For instance,
pgBackRest forbids tablespaces inside PGDATA and when people complain
(more often then you might imagine) we can just point to the code/docs.

There are hints though, like
pg_basebackup's documentation which tells you it won't follow or
preserve them in general, but... hmm, it also contemplates various
special subdirectories (pg_dynshmem, pg_notify, pg_replslot, ...) that
might be symlinks without saying why.

Right, pg_dynshmem is another one that I've seen symlinked. Some things
are nice to have on fast storage. pg_notify and pg_replslot are similar
since they get written to a lot in certain configurations.

It worries me that this needs to be explicitly "turned off" after the
initial recovery. Seems like something of a foot gun.

Since we have not offered this functionality before I'm not sure we
should rush to introduce it now. For backup solutions that do their own
syncing, syncfs() should provide excellent performance so long as the
file system is not shared, which is something the user can control (and
is noted in the docs).

Thanks. I'm leaving the 0002 patch "on ice" until someone can explain
how you're supposed to use it without putting a hole in your foot.

(One silly thing I noticed is that our comments generally think
"filesystem" is one word, but our documentation always has a space;
this patch followed the local convention in both cases!)

Personally I prefer "file system".

Regards,
--
-David
david@pgmasters.net

#39

Greg Stark

stark@mit.edu

almost 5 years ago

In reply to: Tom Lane (#8)

Re: fdatasync performance problem with large number of DB files

On Wed, 10 Mar 2021 at 20:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy. If we expose an option to use
it, there had better be large blinking warnings in the docs.

Isn't that true for fsync and everything else related on
less-than-bleeding-edge kernels anyways?

--
greg

#40

Justin Pryzby

pryzby@telsasoft.com

over 4 years ago

In reply to: Thomas Munro (#37)

Re: fdatasync performance problem with large number of DB files

On Sat, Mar 20, 2021 at 12:16:27PM +1300, Thomas Munro wrote:

+     {
+             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+             },

Is there any reason why this can't be PGC_SIGHUP ?
(Same as restart_after_crash, remove_temp_files_after_crash)

As neat as it'd be, I am not expecting the recovery process to reload the
configuration and finish fast if I send it HUP.

While I'm looking, it's not clear why this needs to be PGC_POSTMASTER.
data_sync_retry - but see b3a156858

This one isn't documented as requiring a restart:
max_logical_replication_workers.

ignore_invalid_pages could probably be SIGHUP, but it's intended to be used as
a commandline option, not in a config file.

--
Justin

#41

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Justin Pryzby (#40)

Re: fdatasync performance problem with large number of DB files

On Tue, May 25, 2021 at 07:13:59PM -0500, Justin Pryzby wrote:

This one isn't documented as requiring a restart:
max_logical_replication_workers.

There is much more than meets the eye here, and this is unrelated to
this thread, so let's discuss that on a separate thread. I'll start a
new one with everything I found.
--
Michael

#42

Justin Pryzby

pryzby@telsasoft.com

over 4 years ago

In reply to: Justin Pryzby (#40)

Re: fdatasync performance problem with large number of DB files

On Tue, May 25, 2021 at 07:13:59PM -0500, Justin Pryzby wrote:

On Sat, Mar 20, 2021 at 12:16:27PM +1300, Thomas Munro wrote:
+     {
+             {"recovery_init_sync_method", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
+                     gettext_noop("Sets the method for synchronizing the data directory before crash recovery."),
+             },
Is there any reason why this can't be PGC_SIGHUP ?
(Same as restart_after_crash, remove_temp_files_after_crash)

I can't see any reason why this is nontrivial.
What about data_sync_retry?

commit 2d2d2e10f99548c486b50a1ce095437d558e8649
Author: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sat May 29 13:41:14 2021 -0500

Change recovery_init_sync_method to PGC_SIGHUP..

The setting has no effect except during startup.
But it's nice to be able to change the setting dynamically, which is expected
to be pretty useful to an admin following crash recovery when turning the
service off and on again is not so appealing.