Volatile write caches on macOS and Windows, redux
Hi,
Continuing a topic from earlier threads[1]/messages/by-id/CA+hUKGJZJVO=iX+eb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ@mail.gmail.com[2]/messages/by-id/20221123014224.xisi44byq3cf5psi@awork3.anarazel.de, I've been wondering
about how to de-klugify wal_sync_method=fsync_writethrough (a setting
that actually affects much more than just WAL), and how to do the
right thing for our users on macOS and Windows by default. Commit
d0c28601 was a very small cleanup in this area. Here are some bigger
ideas I'd like to try out.
Short version:
* Make wal_sync_method=fdatasync the default everywhere
* Drop wal_sync_method=fsync_writethrough
* Add new macOS-only level for the fsync GUC: fsync=full
* Make fsync=full redirect both pg_fsync() and pg_fdatasync()
* Make fsync=full the default on macOS
Motivation:
I think expectations might have changed quite a bit since ~2000. Back
then, fsync() didn't flush write caches on any OS (you were supposed
to use battery-backed controllers and SCSI as found on expensive
proprietary Unix systems if you were serious, IDE/ATA protocols didn't
originally have flush commands, and some consumer drives famously
ignored them or lied, so users of cheap drives were advised to turn
write caches off). Around 2005, Linux decided to start sending the
flush command in fsync(). Windows' FlushFileBuffers() does the same,
and I gathered from Raymond Chen's blog that by the Windows 8
timeframe all consumer drive vendors supported and respected the flush
command. macOS *still* doesn't send it for fsync(), but has had
fcntl(F_FULLFSYNC) since 2003. In Apple's defence, they seem to have
been ahead of the curve on this problem[3]https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00087.html... I suppose they didn't
anticipate that everyone else was going to do it in their main
fsync()/fdatasync() call, they blazed their own trail, and now it all
looks a bit weird.
In other words, back then all systems running PostgreSQL risked data
loss unless you had fancy hardware or turned off unsafe caching. But
now, due to the changing landscape and our policy choices, that is
true only for rarer systems by default while most in our community are
on Linux where this is all just a historical footnote. People's
baseline expectations have moved, and although we try to document the
situation, they are occasionally very surprised: "Loaded footgun
open_datasync on Windows" was Laurenz Albe's reaction[4]/messages/by-id/1527846213.2475.31.camel@cybertec.at to those
paragraphs. Surely we should be able to recover after power loss by
default even on a lowly desktop PC or basic server loaded with SATA
drives, out of the box?
Proposal for Windows:
The existing default use of FILE_FLAG_WRITE_THROUGH is probably a
better choice on hardware where it works reliably (cache disabled,
non-volatile cache, or working FUA support), since it skips a system
call and doesn't wait for incidental other stuff in the cache to
flush, but it's well documented that Windows' SATA drivers neither
pass the "FUA" flag down to the device nor fall back to sending a full
cache flush command. It's also easy to see in the pg_test_fsync
numbers, which are too good to be true on consumer gear. Therefore
wal_sync_method=fdatasync is a better default level. We map that to
NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY). (The "SYNC" in
that flag name means flush the drive cache; the "DATA...ONLY" in that
flag name means skip non-essential stuff like file modification time
etc just like fdatasync() in POSIX, and goes visibly faster thanks to
not journaling metadata.)
Proposal for macOS:
Our current default isn't nice to users who run a database on
mains-powered Macs. I don't have one myself to try it, but "man
fsync" clearly states that you can lose data and it is easily
demonstrated with a traditional cord-yanking test[5]https://news.ycombinator.com/item?id=30372194. You could
certainly lose some recent commits; you could probably also get more
subtle corruption or a total recovery failure like [6]/messages/by-id/18009-40a42f84af3fbda1@postgresql.org too, if for
example the control file can make it to durable storage and while
pointing to a checkpoint that did not (maybe a ZFS-like atomic
root-switch prevents that sort of disorder in APFS, I dunno, but I
read some semi-informed speculation that it doesn't work that way
*shrug*).
We do currently offer a non-default setting
wal_sync_method=fsync_writethough to address all this already.
Despite its name, it affects every caller of pg_fsync() (control file,
data files, etc). It's certainly essential to flush all those files
fully too as part of our recovery protocol, but they're not "WAL".
The new idea here is to provide a separate way of controlling that
global behaviour, and I propose fsync=full. Furthermore, I think that
setting should also affect pg_fdatasync(), given that Apple doesn't
even really have fdatasync() (perhaps if they carry out their threat
to implement it, they'll also invent F_FULLFDATASYNC; for now it
*seems* to be basically just another name for fsync() albeit
undeclared by <unistd.h>).
It's possible that fcntl(F_FULLFSYNC) might fail with ENOSUPP or other
errors in obscure cases (eg unusual file systems). In that case, you
could manually lower fsync to just "on" and do your own research on
whether power loss can toast your database, but that doesn't seem like
a reason for us not to ship good solid defaults for typical users.
Rationale for changing wal_sync_method globally (for now):
With wal_sync_method=fdatasync as default for Linux, FreeBSD, OpenBSD,
DragonflyBSD already, if we added macOS and Windows, that'd leave only
NetBSD, AIX, Solaris/illumos. I don't like having different and more
magical defaults on rare target OSes with no expert users left in our
community (as [6]/messages/by-id/18009-40a42f84af3fbda1@postgresql.org reminded me), so I figure we'd be better off with
the same less magical setting everywhere, as a baseline.
Later we might want a per-platform default again. For example, Linux
(like Windows) has policies on whether to believe FUA works reliably
for the purposes of O_DSYNC, but (unlike Windows) falls back to
sending cache flushes instead of doing nothing, so in theory
open_datasync might be a safe and sometimes better performing default
there. If we decided to do that, we'd just restore the
PLATFORM_DEFAULT_SYNC_METHOD mechanism.
The only other OS where I have detailed enough knowledge to comment is
FreeBSD. Its ZFS flushes caches for all levels just fine, so it
doesn't much matter, while its UFS never got that memo (so it's like a
Mac and probably other old Unixes; maybe I'll get that fixed, see
FreeBSD proposal D36371 if interested). The reasons for using
fdatasync on both FreeBSD and Linux wasn't cache control policies, but
rather some obscure logic of ours that would turn on O_DIRECT in some
cases (and I think in the past when wal_level was lower by default, it
would have been common), which might have complications or fail. The
last trace of that is gone since d4e71df6, so if we were to put Linux
on a 'known-good-for-open_datasync' list I'd probably also consider
putting FreeBSD on the list too.
Note that while this'll slow down some real world databases by being
more careful, 'meson test' time shouldn't be affected on any OS due to
use of fsync=off in tests.
Draft patches attached.
[1]: /messages/by-id/CA+hUKGJZJVO=iX+eb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ@mail.gmail.com
[2]: /messages/by-id/20221123014224.xisi44byq3cf5psi@awork3.anarazel.de
[3]: https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00087.html
[4]: /messages/by-id/1527846213.2475.31.camel@cybertec.at
[5]: https://news.ycombinator.com/item?id=30372194
[6]: /messages/by-id/18009-40a42f84af3fbda1@postgresql.org
Attachments:
0001-Make-wal_sync_method-fdatasync-the-default-on-all-OS.patchtext/x-patch; charset=US-ASCII; name=0001-Make-wal_sync_method-fdatasync-the-default-on-all-OS.patchDownload
From 31517fe5e69225aaffb6b9a49246b45ddad8e028 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Jul 2023 11:41:52 +1200
Subject: [PATCH 1/2] Make wal_sync_method=fdatasync the default on all OSes.
Previously, fdatasync was the default level on Linux and FreeBSD by
special rules, and on OpenBSD and DragonflyBSD because they didn't have
O_DSYNC != O_SYNC or O_DSYNC at all. For every other system, we'd
choose open_datasync.
Use fdatasync everywhere, for consistency. This became possible after
commit 9430fb40 added support for Windows, the last known system not to
have it. This means that we'll now flush caches on consumer drives by
default on Windows, where previously we didn't, which seems like a
better default for crash safety. Users who want the older behavior can
still request it with wal_sync_method=open_datasync.
It's entirely likely that we'll reintroduce per-platform choices in
future, but this commit reverses our previous assumption that
open_datasync is the best choice unless we hear otherwise.
Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
doc/src/sgml/config.sgml | 5 ++---
src/backend/utils/misc/postgresql.conf.sample | 9 ++-------
src/bin/pg_test_fsync/pg_test_fsync.c | 4 ++--
src/include/access/xlogdefs.h | 14 +-------------
src/include/port/freebsd.h | 7 -------
src/include/port/linux.h | 8 --------
6 files changed, 7 insertions(+), 40 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c50c28546d..acb6666e0b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3156,9 +3156,8 @@ include_dir 'conf.d'
</itemizedlist>
<para>
Not all of these choices are available on all platforms.
- The default is the first method in the above list that is supported
- by the platform, except that <literal>fdatasync</literal> is the default on
- Linux and FreeBSD. The default is not necessarily ideal; it might be
+ The default is <literal>fdatasync</literal>.
+ The default is not necessarily ideal; it might be
necessary to change this setting or other aspects of your system
configuration in order to create a crash-safe configuration or
achieve optimal performance.
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4c0269fa3..d466143e4a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -215,13 +215,8 @@
# unrecoverable data corruption)
#synchronous_commit = on # synchronization level;
# off, local, remote_write, remote_apply, or on
-#wal_sync_method = fsync # the default is the first option
- # supported by the operating system:
- # open_datasync
- # fdatasync (default on Linux and FreeBSD)
- # fsync
- # fsync_writethrough
- # open_sync
+#wal_sync_method = fdatasync # fsync, fdatasync, fsync_writethrough,
+ # open_sync, open_datasync
#full_page_writes = on # recover from partial page writes
#wal_log_hints = off # also do full page writes of non-critical updates
# (change requires restart)
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 14fa4acae2..5b431d2f99 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -292,7 +292,7 @@ test_sync(int writes_per_op)
printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
else
printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
- printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));
+ printf(_("(fdatasync is the default)\n"));
/*
* Test open_datasync if available
@@ -326,7 +326,7 @@ test_sync(int writes_per_op)
#endif
/*
- * Test fdatasync if available
+ * Test fdatasync
*/
printf(LABEL_FORMAT, "fdatasync");
fflush(stdout);
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index fe794c7740..a628976902 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -64,19 +64,7 @@ typedef uint32 TimeLineID;
*/
typedef uint16 RepOriginId;
-/*
- * This chunk of hackery attempts to determine which file sync methods
- * are available on the current platform, and to choose an appropriate
- * default method.
- *
- * Note that we define our own O_DSYNC on Windows, but not O_SYNC.
- */
-#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
-#define DEFAULT_SYNC_METHOD PLATFORM_DEFAULT_SYNC_METHOD
-#elif defined(O_DSYNC) && (!defined(O_SYNC) || O_DSYNC != O_SYNC)
-#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN_DSYNC
-#else
+/* Default synchronization method for WAL. */
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
-#endif
#endif /* XLOG_DEFS_H */
diff --git a/src/include/port/freebsd.h b/src/include/port/freebsd.h
index 0e3fde55d6..2e36d3da4f 100644
--- a/src/include/port/freebsd.h
+++ b/src/include/port/freebsd.h
@@ -1,8 +1 @@
/* src/include/port/freebsd.h */
-
-/*
- * Set the default wal_sync_method to fdatasync. xlogdefs.h's normal rules
- * would prefer open_datasync on FreeBSD 13+, but that is not a good choice on
- * many systems.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index 7a6e46cdbb..acd867606c 100644
--- a/src/include/port/linux.h
+++ b/src/include/port/linux.h
@@ -12,11 +12,3 @@
* to have a kernel version test here.
*/
#define HAVE_LINUX_EIDRM_BUG
-
-/*
- * Set the default wal_sync_method to fdatasync. With recent Linux versions,
- * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
- * perform better and (b) causes outright failures on ext4 data=journal
- * filesystems, because those don't support O_DIRECT.
- */
-#define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
--
2.39.2
0002-Remove-fsync_writethrough-add-fsync-full-macOS-only.patchtext/x-patch; charset=US-ASCII; name=0002-Remove-fsync_writethrough-add-fsync-full-macOS-only.patchDownload
From 761792297fffbcabf356cdcb5471845845e54ca7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 14 Jul 2023 21:51:44 +1200
Subject: [PATCH 2/2] Remove fsync_writethrough, add fsync=full (macOS only).
Previously, wal_sync_method=fsync_writethrough affected pg_fsync() of
WAL, control, data and other files, by causing pg_fsync() to use the
fcntl(F_FULLFSYNC) call that Apple recommends for databases.
That's a little confusing, because the GUC purports to affect only WAL
files, and it was also not used by default, exposing users to data loss
risk. New approach:
1. wal_sync_method=fsync_writethrough is removed. Instead, we will
make the existing fsync and fdatasync methods do what Apple recommends.
2. Introduce a new setting fsync=full. This refactoring reflects the
fact that all pg_fsync() calls are affected, not just those for WAL
files. According to expectations established by other modern platforms,
fcntl(F_FULLFSYNC) is the "true" fsync operation on this platform, and
fsync=full will select that. The traditional fsync() system call is
still available by setting fsync=on.
3. Use fcntl(F_FULLSYNC) for pg_fdatasync() too, if fsync=full. Apple
doesn't exactly implement fdatasync(). It's not documented, but exists
in some partially implemented form, without a declaration. As far as we
can tell, it behaves just like fsync(). We'll still use it if you ask
for it with fsync=on.
4. Use fsync=full by default on Macs, following Apple's recommendation
for applications such as databases. It is not available on non-Macs, to
avoid confusing people on other operating systems. Since
wal_sync_method's default was changed to fdatasync by an earlier commit,
the net effect is that Macs now use F_FULLSYNC for all file
synchronization (data, control, WAL, ...) by default.
While here, also get rid of configure probes for F_FULLFSYNC, and just
test for its existence with #ifdef.
---
doc/src/sgml/config.sgml | 31 ++++++--
doc/src/sgml/monitoring.sgml | 8 +-
doc/src/sgml/wal.sgml | 14 +---
meson.build | 1 -
src/backend/access/transam/xlog.c | 18 +----
src/backend/storage/file/fd.c | 74 +++++++------------
src/backend/storage/sync/sync.c | 2 +-
src/backend/utils/init/globals.c | 2 +-
src/backend/utils/misc/guc_tables.c | 41 +++++++---
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/bin/pg_test_fsync/pg_test_fsync.c | 33 ++-------
src/include/access/xlog.h | 3 +-
src/include/miscadmin.h | 15 +++-
src/include/port/darwin.h | 5 --
src/include/storage/fd.h | 2 -
src/tools/msvc/Solution.pm | 1 -
16 files changed, 118 insertions(+), 134 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index acb6666e0b..b34d3901f2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2898,7 +2898,7 @@ include_dir 'conf.d'
</varlistentry>
<varlistentry id="guc-fsync" xreflabel="fsync">
- <term><varname>fsync</varname> (<type>boolean</type>)
+ <term><varname>fsync</varname> (<type>enum</type>)
<indexterm>
<primary><varname>fsync</varname> configuration parameter</primary>
</indexterm>
@@ -2913,6 +2913,22 @@ include_dir 'conf.d'
consistent state after an operating system or hardware crash.
</para>
+ <para>
+ On macOS only, the parameter can also be set to
+ <literal>full</literal>, and that is the default.
+ This value causes <productname>PostgreSQL</productname> to replace all
+ calls to <function>fsync()</function> with
+ <function>fcntl(fd, F_FULLFSYNC)</function>, as recommended by
+ Apple's documentation. It also does the same for calls to
+ <function>fdatasync()</function>, a function that exists but is not
+ documented by Apple.
+ Setting it to <literal>on</literal> causes macOS's regular
+ <function>fsync()</function> and <function>fdatasync()</function>
+ functions to be used as on other platforms. They are known to be
+ considerably faster, but users should consult the warnings in Apple's
+ manual page for <function>fsync</function> first.
+ </para>
+
<para>
While turning off <varname>fsync</varname> is often a performance
benefit, this can result in unrecoverable data corruption in
@@ -3144,11 +3160,6 @@ include_dir 'conf.d'
</para>
</listitem>
<listitem>
- <para>
- <literal>fsync_writethrough</literal> (call <function>fsync()</function> at each commit, forcing write-through of any disk write cache)
- </para>
- </listitem>
- <listitem>
<para>
<literal>open_sync</literal> (write WAL files with <function>open()</function> option <symbol>O_SYNC</symbol>)
</para>
@@ -3165,6 +3176,14 @@ include_dir 'conf.d'
This parameter can only be set in the <filename>postgresql.conf</filename>
file or on the server command line.
</para>
+ <para>
+ Note that on macOS, the <xref linkend="guc-fsync"/> setting can modify
+ the behavior of the <literal>fdatasync</literal> and
+ <literal>fsync</literal> levels, since
+ it can cause all calls to <function>fdatasync()</function> and
+ <function>fsync()</function> to be replaced with an Apple-specific
+ <function>fcntl()</function> call.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 588b720f57..87dd026bc9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3055,8 +3055,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<function>issue_xlog_fsync</function> request
(if <xref linkend="guc-fsync"/> is <literal>on</literal> and
<xref linkend="guc-wal-sync-method"/> is either
- <literal>fdatasync</literal>, <literal>fsync</literal> or
- <literal>fsync_writethrough</literal>, otherwise zero).
+ <literal>fdatasync</literal> or <literal>fsync</literal>,
+ otherwise zero).
See <xref linkend="wal-configuration"/> for more information about
the internal WAL function <function>issue_xlog_fsync</function>.
</para></entry>
@@ -3086,8 +3086,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
(if <varname>track_wal_io_timing</varname> is enabled,
<varname>fsync</varname> is <literal>on</literal>, and
<varname>wal_sync_method</varname> is either
- <literal>fdatasync</literal>, <literal>fsync</literal> or
- <literal>fsync_writethrough</literal>, otherwise zero).
+ <literal>fdatasync</literal> or <literal>fsync</literal>,
+ otherwise zero).
</para></entry>
</row>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 4aad0e1a07..342fc4c6c8 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -114,12 +114,6 @@
</para>
</listitem>
- <listitem>
- <para>
- On <productname>macOS</productname>, write caching can be prevented by
- setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>.
- </para>
- </listitem>
</itemizedlist>
<para>
@@ -758,10 +752,6 @@
The <xref linkend="guc-wal-sync-method"/> parameter determines how
<productname>PostgreSQL</productname> will ask the kernel to force
<acronym>WAL</acronym> updates out to disk.
- All the options should be the same in terms of reliability, with
- the exception of <literal>fsync_writethrough</literal>, which can sometimes
- force a flush of the disk cache even when other options do not do so.
- However, it's quite platform-specific which one will be the fastest.
You can test the speeds of different options using the <xref
linkend="pgtestfsync"/> program.
Note that this parameter is irrelevant if <varname>fsync</varname>
@@ -795,8 +785,8 @@
<literal>open_datasync</literal> or <literal>open_sync</literal>,
a write operation in <function>XLogWrite</function> guarantees to sync written
WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
- If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>,
- <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
+ If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>
+ or <literal>fsync</literal>,
the write operation moves WAL buffers to kernel cache and
<function>issue_xlog_fsync</function> syncs them to disk. Regardless
of the setting of <varname>track_wal_io_timing</varname>, the number
diff --git a/meson.build b/meson.build
index 04ea348852..c37332b4d5 100644
--- a/meson.build
+++ b/meson.build
@@ -2163,7 +2163,6 @@ endforeach
decl_checks = [
- ['F_FULLFSYNC', 'fcntl.h'],
['fdatasync', 'unistd.h'],
['posix_fadvise', 'fcntl.h'],
['strlcat', 'string.h'],
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8b0710abe6..d7241766de 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -173,9 +173,6 @@ static bool check_wal_consistency_checking_deferred = false;
*/
const struct config_enum_entry sync_method_options[] = {
{"fsync", SYNC_METHOD_FSYNC, false},
-#ifdef HAVE_FSYNC_WRITETHROUGH
- {"fsync_writethrough", SYNC_METHOD_FSYNC_WRITETHROUGH, false},
-#endif
{"fdatasync", SYNC_METHOD_FDATASYNC, false},
#ifdef O_SYNC
{"open_sync", SYNC_METHOD_OPEN, false},
@@ -2613,7 +2610,7 @@ XLogFlush(XLogRecPtr record)
* We do not sleep if enableFsync is not turned on, nor if there are
* fewer than CommitSiblings other backends with active transactions.
*/
- if (CommitDelay > 0 && enableFsync &&
+ if (CommitDelay > 0 && enableFsync != ENABLE_FSYNC_OFF &&
MinimumActiveBackends(CommitSiblings))
{
pg_usleep(CommitDelay);
@@ -8084,7 +8081,7 @@ get_sync_bit(int method)
o_direct_flag = PG_O_DIRECT;
/* If fsync is disabled, never open in sync mode */
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return o_direct_flag;
switch (method)
@@ -8096,7 +8093,6 @@ get_sync_bit(int method)
* be seen here.
*/
case SYNC_METHOD_FSYNC:
- case SYNC_METHOD_FSYNC_WRITETHROUGH:
case SYNC_METHOD_FDATASYNC:
return o_direct_flag;
#ifdef O_SYNC
@@ -8171,7 +8167,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
* Quick exit if fsync is disabled or write() has already synced the WAL
* file.
*/
- if (!enableFsync ||
+ if (enableFsync == ENABLE_FSYNC_OFF ||
sync_method == SYNC_METHOD_OPEN ||
sync_method == SYNC_METHOD_OPEN_DSYNC)
return;
@@ -8186,15 +8182,9 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
- if (pg_fsync_no_writethrough(fd) != 0)
+ if (pg_fsync(fd) != 0)
msg = _("could not fsync file \"%s\": %m");
break;
-#ifdef HAVE_FSYNC_WRITETHROUGH
- case SYNC_METHOD_FSYNC_WRITETHROUGH:
- if (pg_fsync_writethrough(fd) != 0)
- msg = _("could not fsync write-through file \"%s\": %m");
- break;
-#endif
case SYNC_METHOD_FDATASYNC:
if (pg_fdatasync(fd) != 0)
msg = _("could not fdatasync file \"%s\": %m");
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a027a8aabc..f8df7dbeae 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -355,11 +355,12 @@ static int fsync_parent_path(const char *fname, int elevel);
/*
- * pg_fsync --- do fsync with or without writethrough
+ * pg_fsync --- do fsync, if enabled, or redirect to Apple's F_FULLFSYNC.
*/
int
pg_fsync(int fd)
{
+ int rc;
#if !defined(WIN32) && defined(USE_ASSERT_CHECKING)
struct stat st;
@@ -398,30 +399,20 @@ pg_fsync(int fd)
errno = 0;
#endif
- /* #if is to skip the sync_method test if there's no need for it */
-#if defined(HAVE_FSYNC_WRITETHROUGH)
- if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
- return pg_fsync_writethrough(fd);
- else
-#endif
- return pg_fsync_no_writethrough(fd);
-}
-
-
-/*
- * pg_fsync_no_writethrough --- same as fsync except does nothing if
- * enableFsync is off
- */
-int
-pg_fsync_no_writethrough(int fd)
-{
- int rc;
-
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return 0;
retry:
- rc = fsync(fd);
+#if defined(ENABLE_FSYNC_FULL)
+ if (enableFsync == ENABLE_FSYNC_FULL)
+ {
+ rc = fcntl(fd, F_FULLFSYNC);
+ }
+ else
+#endif
+ {
+ rc = fsync(fd);
+ }
if (rc == -1 && errno == EINTR)
goto retry;
@@ -430,37 +421,28 @@ retry:
}
/*
- * pg_fsync_writethrough
- */
-int
-pg_fsync_writethrough(int fd)
-{
- if (enableFsync)
- {
-#if defined(F_FULLFSYNC)
- return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
-#else
- errno = ENOSYS;
- return -1;
-#endif
- }
- else
- return 0;
-}
-
-/*
- * pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off
+ * pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off,
+ * and redirects to Apple's F_FULLSYNC if configured.
*/
int
pg_fdatasync(int fd)
{
int rc;
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return 0;
retry:
- rc = fdatasync(fd);
+#if defined(ENABLE_FSYNC_FULL)
+ if (enableFsync == ENABLE_FSYNC_FULL)
+ {
+ rc = fcntl(fd, F_FULLFSYNC);
+ }
+ else
+#endif
+ {
+ rc = fdatasync(fd);
+ }
if (rc == -1 && errno == EINTR)
goto retry;
@@ -482,7 +464,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
* if fsyncs are disabled - that's a decision we might want to make
* configurable at some point.
*/
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return;
/*
@@ -3491,7 +3473,7 @@ SyncDataDirectory(void)
bool xlog_is_symlink;
/* We can skip this whole thing if fsync is disabled. */
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return;
/*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 04fcb06056..1f8d368e08 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -386,7 +386,7 @@ ProcessSyncRequests(void)
* all. (We delay checking until this point so that changing fsync on
* the fly behaves sensibly.)
*/
- if (enableFsync)
+ if (enableFsync != ENABLE_FSYNC_OFF)
{
/*
* If in checkpointer, we want to absorb pending requests every so
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 011ec18015..78d5ef6a63 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -120,7 +120,7 @@ int DateStyle = USE_ISO_DATES;
int DateOrder = DATEORDER_MDY;
int IntervalStyle = INTSTYLE_POSTGRES;
-bool enableFsync = true;
+int enableFsync = DEFAULT_ENABLE_FSYNC;
bool allowSystemTableMods = false;
int work_mem = 4096;
double hash_mem_multiplier = 2.0;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 93dc2e7680..df44f853d4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -477,6 +477,22 @@ static const struct config_enum_entry wal_compression_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry enable_fsync_options[] = {
+ {"on", ENABLE_FSYNC_ON, false},
+ {"off", ENABLE_FSYNC_OFF, false},
+ {"true", ENABLE_FSYNC_ON, true},
+ {"false", ENABLE_FSYNC_OFF, true},
+ {"yes", ENABLE_FSYNC_ON, true},
+ {"no", ENABLE_FSYNC_OFF, true},
+ {"1", ENABLE_FSYNC_ON, true},
+ {"0", ENABLE_FSYNC_OFF, true},
+#ifdef ENABLE_FSYNC_FULL
+ {"full", ENABLE_FSYNC_FULL, false},
+#endif
+ {NULL, 0, false}
+};
+
+
/*
* Options for enum values stored in other modules
*/
@@ -1095,18 +1111,6 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
- {
- {"fsync", PGC_SIGHUP, WAL_SETTINGS,
- gettext_noop("Forces synchronization of updates to disk."),
- gettext_noop("The server will use the fsync() system call in several places to make "
- "sure that updates are physically written to disk. This insures "
- "that a database cluster will recover to a consistent state after "
- "an operating system or hardware crash.")
- },
- &enableFsync,
- true,
- NULL, NULL, NULL
- },
{
{"ignore_checksum_failure", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Continues processing after a checksum failure."),
@@ -4643,6 +4647,19 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"fsync", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Forces synchronization of updates to disk."),
+ gettext_noop("Whether to use syncronization APIs to make "
+ "sure that updates are physically written to disk. This insures "
+ "that a database cluster will recover to a consistent state after "
+ "an operating system or hardware crash.")
+ },
+ &enableFsync,
+ DEFAULT_ENABLE_FSYNC, enable_fsync_options,
+ NULL, NULL, NULL
+ },
+
{
{"transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the current transaction's isolation level."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d466143e4a..f2ec3b68dd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -215,7 +215,7 @@
# unrecoverable data corruption)
#synchronous_commit = on # synchronization level;
# off, local, remote_write, remote_apply, or on
-#wal_sync_method = fdatasync # fsync, fdatasync, fsync_writethrough,
+#wal_sync_method = fdatasync # fsync, fdatasync,
# open_sync, open_datasync
#full_page_writes = on # recover from partial page writes
#wal_log_hints = off # also do full page writes of non-critical updates
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 5b431d2f99..35bc604cf1 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -87,9 +87,6 @@ static DWORD WINAPI process_alarm(LPVOID param);
#endif
static void signal_cleanup(SIGNAL_ARGS);
-#ifdef HAVE_FSYNC_WRITETHROUGH
-static int pg_fsync_writethrough(int fd);
-#endif
static void print_elapse(struct timeval start_t, struct timeval stop_t, int ops);
#define die(msg) pg_fatal("%s: %m", _(msg))
@@ -292,7 +289,7 @@ test_sync(int writes_per_op)
printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
else
printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
- printf(_("(fdatasync is the default)\n"));
+ printf(_("(fdatasync is the default, but see docs for macOS)\n"));
/*
* Test open_datasync if available
@@ -371,12 +368,13 @@ test_sync(int writes_per_op)
close(tmpfile);
/*
- * If fsync_writethrough is available, test as well
+ * Test the macOS fcntl that use instead of fsync/fdatasync levels if
+ * fsync=full.
*/
- printf(LABEL_FORMAT, "fsync_writethrough");
+ printf(LABEL_FORMAT, "fcntl(F_FULLFSYNC)");
fflush(stdout);
-#ifdef HAVE_FSYNC_WRITETHROUGH
+#ifdef F_FULLFSYNC
if ((tmpfile = open(filename, O_RDWR | PG_BINARY, 0)) == -1)
die("could not open output file");
START_TIMER;
@@ -388,8 +386,8 @@ test_sync(int writes_per_op)
XLOG_BLCKSZ,
writes * XLOG_BLCKSZ) != XLOG_BLCKSZ)
die("write failed");
- if (pg_fsync_writethrough(tmpfile) != 0)
- die("fsync failed");
+ if (fcntl(tmpfile, F_FULLFSYNC) != 0)
+ die("fcntl failed");
}
STOP_TIMER;
close(tmpfile);
@@ -504,8 +502,7 @@ test_file_descriptor_sync(void)
/*
* Test whether fsync can sync data written on a different descriptor for
* the same file. This checks the efficiency of multi-process fsyncs
- * against the same file. Possibly this should be done with writethrough
- * on platforms which support it.
+ * against the same file.
*/
printf(_("\nTest if fsync on non-write file descriptor is honored:\n"));
printf(_("(If the times are similar, fsync() can sync data written on a different\n"
@@ -600,20 +597,6 @@ signal_cleanup(SIGNAL_ARGS)
exit(1);
}
-#ifdef HAVE_FSYNC_WRITETHROUGH
-
-static int
-pg_fsync_writethrough(int fd)
-{
-#if defined(F_FULLFSYNC)
- return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
-#else
- errno = ENOSYS;
- return -1;
-#endif
-}
-#endif
-
/*
* print out the writes per second for tests
*/
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 48ca852381..0386c2c7a3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -22,8 +22,7 @@
#define SYNC_METHOD_FSYNC 0
#define SYNC_METHOD_FDATASYNC 1
#define SYNC_METHOD_OPEN 2 /* for O_SYNC */
-#define SYNC_METHOD_FSYNC_WRITETHROUGH 3
-#define SYNC_METHOD_OPEN_DSYNC 4 /* for O_DSYNC */
+#define SYNC_METHOD_OPEN_DSYNC 3 /* for O_DSYNC */
extern PGDLLIMPORT int sync_method;
extern PGDLLIMPORT XLogRecPtr ProcLastRecPtr;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14bd574fc2..26d2ad38b7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -23,6 +23,7 @@
#ifndef MISCADMIN_H
#define MISCADMIN_H
+#include <fcntl.h>
#include <signal.h>
#include "datatype/timestamp.h" /* for TimestampTz */
@@ -256,7 +257,19 @@ extern PGDLLIMPORT int IntervalStyle;
#define MAXTZLEN 10 /* max TZ name len, not counting tr. null */
-extern PGDLLIMPORT bool enableFsync;
+#define ENABLE_FSYNC_OFF 0
+#define ENABLE_FSYNC_ON 1
+#ifdef F_FULLFSYNC
+#define ENABLE_FSYNC_FULL 2
+#endif
+
+#ifdef ENABLE_FSYNC_FULL
+#define DEFAULT_ENABLE_FSYNC ENABLE_FSYNC_FULL
+#else
+#define DEFAULT_ENABLE_FSYNC ENABLE_FSYNC_ON
+#endif
+
+extern PGDLLIMPORT int enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT double hash_mem_multiplier;
diff --git a/src/include/port/darwin.h b/src/include/port/darwin.h
index 15fb69d6db..41e2a427d3 100644
--- a/src/include/port/darwin.h
+++ b/src/include/port/darwin.h
@@ -1,8 +1,3 @@
/* src/include/port/darwin.h */
#define __darwin__ 1
-
-#if HAVE_DECL_F_FULLFSYNC /* not present before macOS 10.3 */
-#define HAVE_FSYNC_WRITETHROUGH
-
-#endif
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 6791a406fc..ed49795f97 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -183,8 +183,6 @@ extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
-extern int pg_fsync_no_writethrough(int fd);
-extern int pg_fsync_writethrough(int fd);
extern int pg_fdatasync(int fd);
extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
extern int pg_truncate(const char *path, off_t length);
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 1cbc857e35..ee963e8351 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -232,7 +232,6 @@ sub GenerateFiles
HAVE_CRTDEFS_H => undef,
HAVE_CRYPTO_LOCK => undef,
HAVE_DECL_FDATASYNC => 0,
- HAVE_DECL_F_FULLFSYNC => 0,
HAVE_DECL_LLVMCREATEGDBREGISTRATIONLISTENER => 0,
HAVE_DECL_LLVMCREATEPERFJITEVENTLISTENER => 0,
HAVE_DECL_LLVMGETHOSTCPUNAME => 0,
--
2.39.2
2024-01 Commitfest.
Hi, this patch was marked in CF as "Needs Review" [1]https://commitfest.postgresql.org/46/4453/, but there has
been no activity on this thread for 6+ months.
Is anything else planned, or can you post something to elicit more
interest in the patch? Otherwise, if nothing happens then the CF entry
will be closed ("Returned with feedback") at the end of this CF.
======
[1]: https://commitfest.postgresql.org/46/4453/
Kind Regards,
Peter Smith.
On Mon, 22 Jan 2024 at 07:46, Peter Smith <smithpb2250@gmail.com> wrote:
2024-01 Commitfest.
Hi, this patch was marked in CF as "Needs Review" [1], but there has
been no activity on this thread for 6+ months.Is anything else planned, or can you post something to elicit more
interest in the patch? Otherwise, if nothing happens then the CF entry
will be closed ("Returned with feedback") at the end of this CF.
With no update to the thread and the patch not applying I'm marking
this as returned with feedback. Please feel free to resubmit to the
next CF when there is a new version of the patch.
Regards,
Vignesh
Rebased over 8d140c58.
Attachments:
v2-0001-Make-wal_sync_method-fdatasync-the-default-on-all.patchapplication/octet-stream; name=v2-0001-Make-wal_sync_method-fdatasync-the-default-on-all.patchDownload
From 98a35ca025776528014703cffc0c6e2ba06ba453 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 15 Jul 2023 11:41:52 +1200
Subject: [PATCH v2 1/2] Make wal_sync_method=fdatasync the default on all
OSes.
Previously, fdatasync was the default level on Linux and FreeBSD by
special rules, and on OpenBSD and DragonflyBSD because they didn't have
O_DSYNC != O_SYNC or O_DSYNC at all. For every other system, we'd
choose open_datasync.
Use fdatasync everywhere, for consistency. This became possible after
commit 9430fb40 added support for Windows, the last known system not to
have it. This means that we'll now flush caches on consumer drives by
default on Windows, where previously we didn't, which seems like a
better default for crash safety. Users who want the older behavior can
still request it with wal_sync_method=open_datasync.
It's entirely likely that we'll reintroduce per-platform choices in
future, but this commit reverses our previous assumption that
open_datasync is the best choice unless we hear otherwise.
Reviewed-by: Anton A. Melnikov <aamelnikov@inbox.ru>
Discussion: https://postgr.es/m/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
doc/src/sgml/config.sgml | 5 ++---
src/backend/utils/misc/postgresql.conf.sample | 9 ++-------
src/bin/pg_test_fsync/pg_test_fsync.c | 4 ++--
src/include/access/xlogdefs.h | 16 ++--------------
src/include/port/freebsd.h | 7 -------
src/include/port/linux.h | 8 --------
6 files changed, 8 insertions(+), 41 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 43b1a132a2..9235caa0c0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3206,9 +3206,8 @@ include_dir 'conf.d'
</itemizedlist>
<para>
Not all of these choices are available on all platforms.
- The default is the first method in the above list that is supported
- by the platform, except that <literal>fdatasync</literal> is the default on
- Linux and FreeBSD. The default is not necessarily ideal; it might be
+ The default is <literal>fdatasync</literal>.
+ The default is not necessarily ideal; it might be
necessary to change this setting or other aspects of your system
configuration in order to create a crash-safe configuration or
achieve optimal performance.
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index edcc0282b2..9b6765ceb5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -224,13 +224,8 @@
# unrecoverable data corruption)
#synchronous_commit = on # synchronization level;
# off, local, remote_write, remote_apply, or on
-#wal_sync_method = fsync # the default is the first option
- # supported by the operating system:
- # open_datasync
- # fdatasync (default on Linux and FreeBSD)
- # fsync
- # fsync_writethrough
- # open_sync
+#wal_sync_method = fdatasync # fsync, fdatasync, fsync_writethrough,
+ # open_sync, open_datasync
#full_page_writes = on # recover from partial page writes
#wal_log_hints = off # also do full page writes of non-critical updates
# (change requires restart)
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 5c0da425fb..04dcd77152 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -298,7 +298,7 @@ test_sync(int writes_per_op)
printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
else
printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
- printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));
+ printf(_("(fdatasync is the default)\n"));
/*
* Test open_datasync if available
@@ -332,7 +332,7 @@ test_sync(int writes_per_op)
#endif
/*
- * Test fdatasync if available
+ * Test fdatasync
*/
printf(LABEL_FORMAT, "fdatasync");
fflush(stdout);
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 3009798952..0c0dce91c7 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -64,19 +64,7 @@ typedef uint32 TimeLineID;
*/
typedef uint16 RepOriginId;
-/*
- * This chunk of hackery attempts to determine which file sync methods
- * are available on the current platform, and to choose an appropriate
- * default method.
- *
- * Note that we define our own O_DSYNC on Windows, but not O_SYNC.
- */
-#if defined(PLATFORM_DEFAULT_WAL_SYNC_METHOD)
-#define DEFAULT_WAL_SYNC_METHOD PLATFORM_DEFAULT_WAL_SYNC_METHOD
-#elif defined(O_DSYNC) && (!defined(O_SYNC) || O_DSYNC != O_SYNC)
-#define DEFAULT_WAL_SYNC_METHOD WAL_SYNC_METHOD_OPEN_DSYNC
-#else
-#define DEFAULT_WAL_SYNC_METHOD WAL_SYNC_METHOD_FDATASYNC
-#endif
+/* Default synchronization method for WAL. */
+#define DEFAULT_WAL_SYNC_METHOD SYNC_METHOD_FDATASYNC
#endif /* XLOG_DEFS_H */
diff --git a/src/include/port/freebsd.h b/src/include/port/freebsd.h
index c604187acd..2e36d3da4f 100644
--- a/src/include/port/freebsd.h
+++ b/src/include/port/freebsd.h
@@ -1,8 +1 @@
/* src/include/port/freebsd.h */
-
-/*
- * Set the default wal_sync_method to fdatasync. xlogdefs.h's normal rules
- * would prefer open_datasync on FreeBSD 13+, but that is not a good choice on
- * many systems.
- */
-#define PLATFORM_DEFAULT_WAL_SYNC_METHOD WAL_SYNC_METHOD_FDATASYNC
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index 8101af2b93..acd867606c 100644
--- a/src/include/port/linux.h
+++ b/src/include/port/linux.h
@@ -12,11 +12,3 @@
* to have a kernel version test here.
*/
#define HAVE_LINUX_EIDRM_BUG
-
-/*
- * Set the default wal_sync_method to fdatasync. With recent Linux versions,
- * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
- * perform better and (b) causes outright failures on ext4 data=journal
- * filesystems, because those don't support O_DIRECT.
- */
-#define PLATFORM_DEFAULT_WAL_SYNC_METHOD WAL_SYNC_METHOD_FDATASYNC
--
2.39.3 (Apple Git-145)
v2-0002-Remove-fsync_writethrough-add-fsync-full-macOS-on.patchapplication/octet-stream; name=v2-0002-Remove-fsync_writethrough-add-fsync-full-macOS-on.patchDownload
From 4127d39582b350b0b29e6b2681a88a67bcd4c9f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 14 Jul 2023 21:51:44 +1200
Subject: [PATCH v2 2/2] Remove fsync_writethrough, add fsync=full (macOS
only).
Previously, wal_sync_method=fsync_writethrough affected pg_fsync() of
WAL, control, data and other files, by causing pg_fsync() to use the
fcntl(F_FULLFSYNC) call that Apple recommends for databases.
That's a little confusing, because the GUC purports to affect only WAL
files, and it was also not used by default, exposing users to data loss
risk. New approach:
1. wal_sync_method=fsync_writethrough is removed. Instead, we will
make the existing fsync and fdatasync methods do what Apple recommends.
2. Introduce a new setting fsync=full. This refactoring reflects the
fact that all pg_fsync() calls are affected, not just those for WAL
files. According to expectations established by other modern platforms,
fcntl(F_FULLFSYNC) is the "true" fsync operation on this platform, and
fsync=full will select that. The traditional fsync() system call is
still available by setting fsync=on.
3. Use fcntl(F_FULLSYNC) for pg_fdatasync() too, if fsync=full. Apple
doesn't exactly implement fdatasync(). It's not documented, but exists
in some partially implemented form, without a declaration. As far as we
can tell, it behaves just like fsync(). We'll still use it if you ask
for it with fsync=on.
4. Use fsync=full by default on Macs, following Apple's recommendation
for applications such as databases. It is not available on non-Macs, to
avoid confusing people on other operating systems. Since
wal_sync_method's default was changed to fdatasync by an earlier commit,
the net effect is that Macs now use F_FULLSYNC for all file
synchronization (data, control, WAL, ...) by default.
While here, also get rid of configure probes for F_FULLFSYNC, and just
test for its existence with #ifdef.
Discussion: https://postgr.es/m/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com
---
configure | 14 ----
configure.ac | 3 -
doc/src/sgml/config.sgml | 31 ++++++--
doc/src/sgml/monitoring.sgml | 8 +-
doc/src/sgml/wal.sgml | 14 +---
meson.build | 1 -
src/backend/access/transam/xlog.c | 18 +----
src/backend/storage/file/fd.c | 74 +++++++------------
src/backend/storage/sync/sync.c | 2 +-
src/backend/utils/init/globals.c | 2 +-
src/backend/utils/misc/guc_tables.c | 41 +++++++---
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/bin/pg_test_fsync/pg_test_fsync.c | 33 ++-------
src/include/access/xlog.h | 1 -
src/include/access/xlogdefs.h | 2 +-
src/include/miscadmin.h | 15 +++-
src/include/pg_config.h.in | 4 -
src/include/port/darwin.h | 5 --
src/include/storage/fd.h | 2 -
19 files changed, 118 insertions(+), 154 deletions(-)
diff --git a/configure b/configure
index 46859a4244..0aa7e4c86b 100755
--- a/configure
+++ b/configure
@@ -15804,20 +15804,6 @@ cat >>confdefs.h <<_ACEOF
_ACEOF
-# This is probably only present on macOS, but may as well check always
-ac_fn_c_check_decl "$LINENO" "F_FULLFSYNC" "ac_cv_have_decl_F_FULLFSYNC" "#include <fcntl.h>
-"
-if test "x$ac_cv_have_decl_F_FULLFSYNC" = xyes; then :
- ac_have_decl=1
-else
- ac_have_decl=0
-fi
-
-cat >>confdefs.h <<_ACEOF
-#define HAVE_DECL_F_FULLFSYNC $ac_have_decl
-_ACEOF
-
-
ac_fn_c_check_func "$LINENO" "explicit_bzero" "ac_cv_func_explicit_bzero"
if test "x$ac_cv_func_explicit_bzero" = xyes; then :
$as_echo "#define HAVE_EXPLICIT_BZERO 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index 88b75a7696..a3f0a0575f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1801,9 +1801,6 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
AC_CHECK_DECLS([preadv], [], [], [#include <sys/uio.h>])
AC_CHECK_DECLS([pwritev], [], [], [#include <sys/uio.h>])
-# This is probably only present on macOS, but may as well check always
-AC_CHECK_DECLS(F_FULLFSYNC, [], [], [#include <fcntl.h>])
-
AC_REPLACE_FUNCS(m4_normalize([
explicit_bzero
getopt
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9235caa0c0..b571cd69c7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2948,7 +2948,7 @@ include_dir 'conf.d'
</varlistentry>
<varlistentry id="guc-fsync" xreflabel="fsync">
- <term><varname>fsync</varname> (<type>boolean</type>)
+ <term><varname>fsync</varname> (<type>enum</type>)
<indexterm>
<primary><varname>fsync</varname> configuration parameter</primary>
</indexterm>
@@ -2963,6 +2963,22 @@ include_dir 'conf.d'
consistent state after an operating system or hardware crash.
</para>
+ <para>
+ On macOS only, the parameter can also be set to
+ <literal>full</literal>, and that is the default.
+ This value causes <productname>PostgreSQL</productname> to replace all
+ calls to <function>fsync()</function> with
+ <function>fcntl(fd, F_FULLFSYNC)</function>, as recommended by
+ Apple's documentation. It also does the same for calls to
+ <function>fdatasync()</function>, a function that exists but is not
+ documented by Apple.
+ Setting it to <literal>on</literal> causes macOS's regular
+ <function>fsync()</function> and <function>fdatasync()</function>
+ functions to be used as on other platforms. They are known to be
+ considerably faster, but users should consult the warnings in Apple's
+ manual page for <function>fsync</function> first.
+ </para>
+
<para>
While turning off <varname>fsync</varname> is often a performance
benefit, this can result in unrecoverable data corruption in
@@ -3194,11 +3210,6 @@ include_dir 'conf.d'
</para>
</listitem>
<listitem>
- <para>
- <literal>fsync_writethrough</literal> (call <function>fsync()</function> at each commit, forcing write-through of any disk write cache)
- </para>
- </listitem>
- <listitem>
<para>
<literal>open_sync</literal> (write WAL files with <function>open()</function> option <symbol>O_SYNC</symbol>)
</para>
@@ -3215,6 +3226,14 @@ include_dir 'conf.d'
This parameter can only be set in the <filename>postgresql.conf</filename>
file or on the server command line.
</para>
+ <para>
+ Note that on macOS, the <xref linkend="guc-fsync"/> setting can modify
+ the behavior of the <literal>fdatasync</literal> and
+ <literal>fsync</literal> levels, since
+ it can cause all calls to <function>fdatasync()</function> and
+ <function>fsync()</function> to be replaced with an Apple-specific
+ <function>fcntl()</function> call.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4b8b38b70e..4116dc3c09 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3138,8 +3138,8 @@ description | Waiting for a newly initialized WAL file to reach durable storage
<function>issue_xlog_fsync</function> request
(if <xref linkend="guc-fsync"/> is <literal>on</literal> and
<xref linkend="guc-wal-sync-method"/> is either
- <literal>fdatasync</literal>, <literal>fsync</literal> or
- <literal>fsync_writethrough</literal>, otherwise zero).
+ <literal>fdatasync</literal> or <literal>fsync</literal>,
+ otherwise zero).
See <xref linkend="wal-configuration"/> for more information about
the internal WAL function <function>issue_xlog_fsync</function>.
</para></entry>
@@ -3169,8 +3169,8 @@ description | Waiting for a newly initialized WAL file to reach durable storage
(if <varname>track_wal_io_timing</varname> is enabled,
<varname>fsync</varname> is <literal>on</literal>, and
<varname>wal_sync_method</varname> is either
- <literal>fdatasync</literal>, <literal>fsync</literal> or
- <literal>fsync_writethrough</literal>, otherwise zero).
+ <literal>fdatasync</literal> or <literal>fsync</literal>,
+ otherwise zero).
</para></entry>
</row>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index 05e2a8f8be..d73f0c1480 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -114,12 +114,6 @@
</para>
</listitem>
- <listitem>
- <para>
- On <productname>macOS</productname>, write caching can be prevented by
- setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>.
- </para>
- </listitem>
</itemizedlist>
<para>
@@ -785,10 +779,6 @@
The <xref linkend="guc-wal-sync-method"/> parameter determines how
<productname>PostgreSQL</productname> will ask the kernel to force
<acronym>WAL</acronym> updates out to disk.
- All the options should be the same in terms of reliability, with
- the exception of <literal>fsync_writethrough</literal>, which can sometimes
- force a flush of the disk cache even when other options do not do so.
- However, it's quite platform-specific which one will be the fastest.
You can test the speeds of different options using the <xref
linkend="pgtestfsync"/> program.
Note that this parameter is irrelevant if <varname>fsync</varname>
@@ -822,8 +812,8 @@
<literal>open_datasync</literal> or <literal>open_sync</literal>,
a write operation in <function>XLogWrite</function> guarantees to sync written
WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
- If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>,
- <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
+ If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>
+ or <literal>fsync</literal>,
the write operation moves WAL buffers to kernel cache and
<function>issue_xlog_fsync</function> syncs them to disk. Regardless
of the setting of <varname>track_wal_io_timing</varname>, the number
diff --git a/meson.build b/meson.build
index a198eca25d..84aa4ff8a1 100644
--- a/meson.build
+++ b/meson.build
@@ -2188,7 +2188,6 @@ endforeach
decl_checks = [
- ['F_FULLFSYNC', 'fcntl.h'],
['fdatasync', 'unistd.h'],
['posix_fadvise', 'fcntl.h'],
['strlcat', 'string.h'],
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eb02e3b6a6..d8190323b6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -174,9 +174,6 @@ static bool check_wal_consistency_checking_deferred = false;
*/
const struct config_enum_entry wal_sync_method_options[] = {
{"fsync", WAL_SYNC_METHOD_FSYNC, false},
-#ifdef HAVE_FSYNC_WRITETHROUGH
- {"fsync_writethrough", WAL_SYNC_METHOD_FSYNC_WRITETHROUGH, false},
-#endif
{"fdatasync", WAL_SYNC_METHOD_FDATASYNC, false},
#ifdef O_SYNC
{"open_sync", WAL_SYNC_METHOD_OPEN, false},
@@ -2832,7 +2829,7 @@ XLogFlush(XLogRecPtr record)
* We do not sleep if enableFsync is not turned on, nor if there are
* fewer than CommitSiblings other backends with active transactions.
*/
- if (CommitDelay > 0 && enableFsync &&
+ if (CommitDelay > 0 && enableFsync != ENABLE_FSYNC_OFF &&
MinimumActiveBackends(CommitSiblings))
{
pg_usleep(CommitDelay);
@@ -8435,7 +8432,7 @@ get_sync_bit(int method)
o_direct_flag = PG_O_DIRECT;
/* If fsync is disabled, never open in sync mode */
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return o_direct_flag;
switch (method)
@@ -8447,7 +8444,6 @@ get_sync_bit(int method)
* be seen here.
*/
case WAL_SYNC_METHOD_FSYNC:
- case WAL_SYNC_METHOD_FSYNC_WRITETHROUGH:
case WAL_SYNC_METHOD_FDATASYNC:
return o_direct_flag;
#ifdef O_SYNC
@@ -8522,7 +8518,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
* Quick exit if fsync is disabled or write() has already synced the WAL
* file.
*/
- if (!enableFsync ||
+ if (enableFsync == ENABLE_FSYNC_OFF ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
return;
@@ -8537,15 +8533,9 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
switch (wal_sync_method)
{
case WAL_SYNC_METHOD_FSYNC:
- if (pg_fsync_no_writethrough(fd) != 0)
+ if (pg_fsync(fd) != 0)
msg = _("could not fsync file \"%s\": %m");
break;
-#ifdef HAVE_FSYNC_WRITETHROUGH
- case WAL_SYNC_METHOD_FSYNC_WRITETHROUGH:
- if (pg_fsync_writethrough(fd) != 0)
- msg = _("could not fsync write-through file \"%s\": %m");
- break;
-#endif
case WAL_SYNC_METHOD_FDATASYNC:
if (pg_fdatasync(fd) != 0)
msg = _("could not fdatasync file \"%s\": %m");
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index d298e4842c..1f4fb70ab6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -380,11 +380,12 @@ ResourceOwnerForgetFile(ResourceOwner owner, File file)
}
/*
- * pg_fsync --- do fsync with or without writethrough
+ * pg_fsync --- do fsync, if enabled, or redirect to Apple's F_FULLFSYNC.
*/
int
pg_fsync(int fd)
{
+ int rc;
#if !defined(WIN32) && defined(USE_ASSERT_CHECKING)
struct stat st;
@@ -423,30 +424,20 @@ pg_fsync(int fd)
errno = 0;
#endif
- /* #if is to skip the wal_sync_method test if there's no need for it */
-#if defined(HAVE_FSYNC_WRITETHROUGH)
- if (wal_sync_method == WAL_SYNC_METHOD_FSYNC_WRITETHROUGH)
- return pg_fsync_writethrough(fd);
- else
-#endif
- return pg_fsync_no_writethrough(fd);
-}
-
-
-/*
- * pg_fsync_no_writethrough --- same as fsync except does nothing if
- * enableFsync is off
- */
-int
-pg_fsync_no_writethrough(int fd)
-{
- int rc;
-
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return 0;
retry:
- rc = fsync(fd);
+#if defined(ENABLE_FSYNC_FULL)
+ if (enableFsync == ENABLE_FSYNC_FULL)
+ {
+ rc = fcntl(fd, F_FULLFSYNC);
+ }
+ else
+#endif
+ {
+ rc = fsync(fd);
+ }
if (rc == -1 && errno == EINTR)
goto retry;
@@ -455,37 +446,28 @@ retry:
}
/*
- * pg_fsync_writethrough
- */
-int
-pg_fsync_writethrough(int fd)
-{
- if (enableFsync)
- {
-#if defined(F_FULLFSYNC)
- return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
-#else
- errno = ENOSYS;
- return -1;
-#endif
- }
- else
- return 0;
-}
-
-/*
- * pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off
+ * pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off,
+ * and redirects to Apple's F_FULLFSYNC if configured.
*/
int
pg_fdatasync(int fd)
{
int rc;
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return 0;
retry:
- rc = fdatasync(fd);
+#if defined(ENABLE_FSYNC_FULL)
+ if (enableFsync == ENABLE_FSYNC_FULL)
+ {
+ rc = fcntl(fd, F_FULLFSYNC);
+ }
+ else
+#endif
+ {
+ rc = fdatasync(fd);
+ }
if (rc == -1 && errno == EINTR)
goto retry;
@@ -530,7 +512,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
* if fsyncs are disabled - that's a decision we might want to make
* configurable at some point.
*/
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return;
/*
@@ -3546,7 +3528,7 @@ SyncDataDirectory(void)
bool xlog_is_symlink;
/* We can skip this whole thing if fsync is disabled. */
- if (!enableFsync)
+ if (enableFsync == ENABLE_FSYNC_OFF)
return;
/*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 581faf5f29..224a01e7aa 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -386,7 +386,7 @@ ProcessSyncRequests(void)
* all. (We delay checking until this point so that changing fsync on
* the fly behaves sensibly.)
*/
- if (enableFsync)
+ if (enableFsync != ENABLE_FSYNC_OFF)
{
/*
* If in checkpointer, we want to absorb pending requests every so
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 5eaee88d96..8c1c24004b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -123,7 +123,7 @@ int DateStyle = USE_ISO_DATES;
int DateOrder = DATEORDER_MDY;
int IntervalStyle = INTSTYLE_POSTGRES;
-bool enableFsync = true;
+int enableFsync = DEFAULT_ENABLE_FSYNC;
bool allowSystemTableMods = false;
int work_mem = 4096;
double hash_mem_multiplier = 2.0;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 93ded31ed9..06a4414e14 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -482,6 +482,22 @@ static const struct config_enum_entry wal_compression_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry enable_fsync_options[] = {
+ {"on", ENABLE_FSYNC_ON, false},
+ {"off", ENABLE_FSYNC_OFF, false},
+ {"true", ENABLE_FSYNC_ON, true},
+ {"false", ENABLE_FSYNC_OFF, true},
+ {"yes", ENABLE_FSYNC_ON, true},
+ {"no", ENABLE_FSYNC_OFF, true},
+ {"1", ENABLE_FSYNC_ON, true},
+ {"0", ENABLE_FSYNC_OFF, true},
+#ifdef ENABLE_FSYNC_FULL
+ {"full", ENABLE_FSYNC_FULL, false},
+#endif
+ {NULL, 0, false}
+};
+
+
/*
* Options for enum values stored in other modules
*/
@@ -1085,18 +1101,6 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
- {
- {"fsync", PGC_SIGHUP, WAL_SETTINGS,
- gettext_noop("Forces synchronization of updates to disk."),
- gettext_noop("The server will use the fsync() system call in several places to make "
- "sure that updates are physically written to disk. This ensures "
- "that a database cluster will recover to a consistent state after "
- "an operating system or hardware crash.")
- },
- &enableFsync,
- true,
- NULL, NULL, NULL
- },
{
{"ignore_checksum_failure", PGC_SUSET, DEVELOPER_OPTIONS,
gettext_noop("Continues processing after a checksum failure."),
@@ -4753,6 +4757,19 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"fsync", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Forces synchronization of updates to disk."),
+ gettext_noop("Whether to use syncronization APIs to make "
+ "sure that updates are physically written to disk. This insures "
+ "that a database cluster will recover to a consistent state after "
+ "an operating system or hardware crash.")
+ },
+ &enableFsync,
+ DEFAULT_ENABLE_FSYNC, enable_fsync_options,
+ NULL, NULL, NULL
+ },
+
{
{"transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the current transaction's isolation level."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9b6765ceb5..ff65ae8c46 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -224,7 +224,7 @@
# unrecoverable data corruption)
#synchronous_commit = on # synchronization level;
# off, local, remote_write, remote_apply, or on
-#wal_sync_method = fdatasync # fsync, fdatasync, fsync_writethrough,
+#wal_sync_method = fdatasync # fsync, fdatasync,
# open_sync, open_datasync
#full_page_writes = on # recover from partial page writes
#wal_log_hints = off # also do full page writes of non-critical updates
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 04dcd77152..41cd537a84 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -93,9 +93,6 @@ static DWORD WINAPI process_alarm(LPVOID param);
#endif
static void signal_cleanup(SIGNAL_ARGS);
-#ifdef HAVE_FSYNC_WRITETHROUGH
-static int pg_fsync_writethrough(int fd);
-#endif
static void print_elapse(struct timeval start_t, struct timeval stop_t, int ops);
#define die(msg) pg_fatal("%s: %m", _(msg))
@@ -298,7 +295,7 @@ test_sync(int writes_per_op)
printf(_("\nCompare file sync methods using one %dkB write:\n"), XLOG_BLCKSZ_K);
else
printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
- printf(_("(fdatasync is the default)\n"));
+ printf(_("(fdatasync is the default, but see docs for macOS)\n"));
/*
* Test open_datasync if available
@@ -377,12 +374,13 @@ test_sync(int writes_per_op)
close(tmpfile);
/*
- * If fsync_writethrough is available, test as well
+ * Test the macOS fcntl that use instead of fsync/fdatasync levels if
+ * fsync=full.
*/
- printf(LABEL_FORMAT, "fsync_writethrough");
+ printf(LABEL_FORMAT, "fcntl(F_FULLFSYNC)");
fflush(stdout);
-#ifdef HAVE_FSYNC_WRITETHROUGH
+#ifdef F_FULLFSYNC
if ((tmpfile = open(filename, O_RDWR | PG_BINARY, 0)) == -1)
die("could not open output file");
START_TIMER;
@@ -394,8 +392,8 @@ test_sync(int writes_per_op)
XLOG_BLCKSZ,
writes * XLOG_BLCKSZ) != XLOG_BLCKSZ)
die("write failed");
- if (pg_fsync_writethrough(tmpfile) != 0)
- die("fsync failed");
+ if (fcntl(tmpfile, F_FULLFSYNC) != 0)
+ die("fcntl failed");
}
STOP_TIMER;
close(tmpfile);
@@ -510,8 +508,7 @@ test_file_descriptor_sync(void)
/*
* Test whether fsync can sync data written on a different descriptor for
* the same file. This checks the efficiency of multi-process fsyncs
- * against the same file. Possibly this should be done with writethrough
- * on platforms which support it.
+ * against the same file.
*/
printf(_("\nTest if fsync on non-write file descriptor is honored:\n"));
printf(_("(If the times are similar, fsync() can sync data written on a different\n"
@@ -609,20 +606,6 @@ signal_cleanup(SIGNAL_ARGS)
_exit(1);
}
-#ifdef HAVE_FSYNC_WRITETHROUGH
-
-static int
-pg_fsync_writethrough(int fd)
-{
-#if defined(F_FULLFSYNC)
- return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
-#else
- errno = ENOSYS;
- return -1;
-#endif
-}
-#endif
-
/*
* print out the writes per second for tests
*/
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..4028738eca 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -24,7 +24,6 @@ typedef enum WalSyncMethod
WAL_SYNC_METHOD_FSYNC = 0,
WAL_SYNC_METHOD_FDATASYNC,
WAL_SYNC_METHOD_OPEN, /* for O_SYNC */
- WAL_SYNC_METHOD_FSYNC_WRITETHROUGH,
WAL_SYNC_METHOD_OPEN_DSYNC /* for O_DSYNC */
} WalSyncMethod;
extern PGDLLIMPORT int wal_sync_method;
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 0c0dce91c7..27c04a3053 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -65,6 +65,6 @@ typedef uint32 TimeLineID;
typedef uint16 RepOriginId;
/* Default synchronization method for WAL. */
-#define DEFAULT_WAL_SYNC_METHOD SYNC_METHOD_FDATASYNC
+#define DEFAULT_WAL_SYNC_METHOD WAL_SYNC_METHOD_FDATASYNC
#endif /* XLOG_DEFS_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 756d144c32..0563d18b09 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -23,6 +23,7 @@
#ifndef MISCADMIN_H
#define MISCADMIN_H
+#include <fcntl.h>
#include <signal.h>
#include "datatype/timestamp.h" /* for TimestampTz */
@@ -267,7 +268,19 @@ extern PGDLLIMPORT int IntervalStyle;
#define MAXTZLEN 10 /* max TZ name len, not counting tr. null */
-extern PGDLLIMPORT bool enableFsync;
+#define ENABLE_FSYNC_OFF 0
+#define ENABLE_FSYNC_ON 1
+#ifdef F_FULLFSYNC
+#define ENABLE_FSYNC_FULL 2
+#endif
+
+#ifdef ENABLE_FSYNC_FULL
+#define DEFAULT_ENABLE_FSYNC ENABLE_FSYNC_FULL
+#else
+#define DEFAULT_ENABLE_FSYNC ENABLE_FSYNC_ON
+#endif
+
+extern PGDLLIMPORT int enableFsync;
extern PGDLLIMPORT bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
extern PGDLLIMPORT double hash_mem_multiplier;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07e73567dc..1d4925bfc4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -88,10 +88,6 @@
don't. */
#undef HAVE_DECL_FDATASYNC
-/* Define to 1 if you have the declaration of `F_FULLFSYNC', and to 0 if you
- don't. */
-#undef HAVE_DECL_F_FULLFSYNC
-
/* Define to 1 if you have the declaration of
`LLVMCreateGDBRegistrationListener', and to 0 if you don't. */
#undef HAVE_DECL_LLVMCREATEGDBREGISTRATIONLISTENER
diff --git a/src/include/port/darwin.h b/src/include/port/darwin.h
index 15fb69d6db..41e2a427d3 100644
--- a/src/include/port/darwin.h
+++ b/src/include/port/darwin.h
@@ -1,8 +1,3 @@
/* src/include/port/darwin.h */
#define __darwin__ 1
-
-#if HAVE_DECL_F_FULLFSYNC /* not present before macOS 10.3 */
-#define HAVE_FSYNC_WRITETHROUGH
-
-#endif
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 60bba5c970..ecbefce3d2 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -179,8 +179,6 @@ extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
extern bool looks_like_temp_rel_name(const char *name);
extern int pg_fsync(int fd);
-extern int pg_fsync_no_writethrough(int fd);
-extern int pg_fsync_writethrough(int fd);
extern int pg_fdatasync(int fd);
extern bool pg_file_exists(const char *fname);
extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
--
2.39.3 (Apple Git-145)
Short sales pitch for these patches:
* the default settings eat data on Macs and Windows
* nobody understands what wal_sync_method=fsync_writethrough means anyway
* it's a weird kludge that it affects not only WAL, let's clean that up
On Thu, Mar 14, 2024 at 01:12:05PM +1300, Thomas Munro wrote:
Short sales pitch for these patches:
* the default settings eat data on Macs and Windows
* nobody understands what wal_sync_method=fsync_writethrough means anyway
* it's a weird kludge that it affects not only WAL, let's clean that up
I recently started using macOS for hacking on Postgres and noticed this
problem, so I was delighted to find this thread. I intend to review
further soon, but +1 for improving the default settings. I think we might
also need some additional fcntl(F_FULLFSYNC) calls in sync_pgdata(),
sync_dir_recurse(), etc., which are used by initdb, pg_basebackup, and
more.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, 18 Jul 2023 at 05:29, Thomas Munro <thomas.munro@gmail.com> wrote:
It's possible that fcntl(F_FULLFSYNC) might fail with ENOSUPP or other
errors in obscure cases (eg unusual file systems). In that case, you
could manually lower fsync to just "on" and do your own research on
whether power loss can toast your database, but that doesn't seem like
a reason for us not to ship good solid defaults for typical users.
Is this the only reason why you're suggesting adding fsync=full,
instead of simply always setting F_FULLFSYNC when fsync=true on MacOS.
If so, I'm not sure we really gain anything by this tri-state. I think
people either care about data loss on power loss, or they don't. I
doubt many people want his third intermediate option, which afaict
basically means lose data on powerloss less often than fsync=false but
still lose data most of the time.
If you're going to keep this tri-state for MacOS, then it still seems
nicer to me to "fix" fsync=true on MacOS and introduce a fsync=partial
or something. Then defaults are the same across platforms and anyone
setting fsync=yes currently in their postgresql.conf would get the
fixed behaviour on upgrade.
On 25.05.24 04:01, Jelte Fennema-Nio wrote:
Is this the only reason why you're suggesting adding fsync=full,
instead of simply always setting F_FULLFSYNC when fsync=true on MacOS.
If so, I'm not sure we really gain anything by this tri-state. I think
people either care about data loss on power loss, or they don't. I
doubt many people want his third intermediate option, which afaict
basically means lose data on powerloss less often than fsync=false but
still lose data most of the time.
I agree, two states should be enough. It could basically just be
pg_fsync(int fd)
{
#if macos
fcntl(fd, F_FULLFSYNC);
#else
fsync(fd);
#endif
}
On Wed, May 29, 2024 at 06:49:57AM -0700, Peter Eisentraut wrote:
On 25.05.24 04:01, Jelte Fennema-Nio wrote:
Is this the only reason why you're suggesting adding fsync=full,
instead of simply always setting F_FULLFSYNC when fsync=true on MacOS.
If so, I'm not sure we really gain anything by this tri-state. I think
people either care about data loss on power loss, or they don't. I
doubt many people want his third intermediate option, which afaict
basically means lose data on powerloss less often than fsync=false but
still lose data most of the time.I agree, two states should be enough. It could basically just be
pg_fsync(int fd)
{
#if macos
fcntl(fd, F_FULLFSYNC);
#else
fsync(fd);
#endif
}
IIUC with this approach, anyone who is using a file system that fails
fcntl(F_FULLSYNC) with ENOSUPP would have to turn fsync off. That might be
the right thing to do since having a third option that sends the data to
the disk cache but doesn't provide any real guarantees if you lose power
may not be worth much. However, if such a file system _did_ provide such
guarantees with just fsync(), then it would be unfortunate to force people
to turn fsync off. But this could very well all be hypothetical, for all I
know... In any case, I agree that we should probably use F_FULLFSYNC by
default on macOS.
--
nathan
On 03.06.24 17:28, Nathan Bossart wrote:
I agree, two states should be enough. It could basically just be
pg_fsync(int fd)
{
#if macos
fcntl(fd, F_FULLFSYNC);
#else
fsync(fd);
#endif
}IIUC with this approach, anyone who is using a file system that fails
fcntl(F_FULLSYNC) with ENOSUPP would have to turn fsync off. That might be
the right thing to do since having a third option that sends the data to
the disk cache but doesn't provide any real guarantees if you lose power
may not be worth much. However, if such a file system_did_ provide such
guarantees with just fsync(), then it would be unfortunate to force people
to turn fsync off. But this could very well all be hypothetical, for all I
know... In any case, I agree that we should probably use F_FULLFSYNC by
default on macOS.
Yeah, my example code above says "#if macos", not "#ifdef F_FULLSYNC".
The latter might be a problem along the lines you describe if other
systems use that symbol in a slightly different manner.