Fwd: Is the fsync() fake on FreeBSD6.1?

Started by Jim Nasbyover 19 years ago8 messages
#1Jim Nasby
jim@nasby.net

I thought folks might be interested in this... note in particular the
comment about linux.

Begin forwarded message:

From: Greg 'groggy' Lehey <grog@FreeBSD.org>
Date: June 26, 2006 11:34:12 PM EDT
To: leo huang <leo.huang.list@gmail.com>
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?

On Tuesday, 27 June 2006 at 10:18:47 +0800, leo huang wrote:

Hi,

I benchmarked MySQL 4.1.18 on FreeBSD 6.1 and Debian 3.1 using
Super Smack
1.3 some days ago.

...

The result surprise me. The MySQL Performance on FreeBSD6.1 is about
10 times of on Debian3.1??and the output of iostat also shows it.

I know that MySQL uses fsync() to flush both the data and log files
at default when using innodb engine(
http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html). Our
evaluating computer only has a 10000RPM SCSI hard disk. I think it
can do about 200 sequential fsync() calls per second if the fsync()
is real.

Is the fsync() on FreeBSD6.1 fake?

My understanding from the last time I looked at the code was that
fsync does the right thing:

The fsync() system call causes all modified data and
attributes of fd to
be moved to a permanent storage device. This normally results
in all in-
core modified copies of buffers for the associated file to be
written to
a disk.

This is not the case for Linux, where fsync syncs the entire file
system. That could explain some of the performance difference, but
not all of it. I suppose it's worth noting that, in general, people
report much better performance with MySQL on Linux than on FreeBSD.

I mean than the data is only written to the drives memory and so can
be lost if power goes down.

I don't believe that fsync is required to flush the drive buffers. It
would be nice to have a function that did, though.

And how I can confirm this?

Trial and error?

Greg
--
See complete headers for address and phone numbers.

--
Jim Nasby jimn@enterprisedb.com
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#2Noname
mark@mark.mielke.cc
In reply to: Jim Nasby (#1)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

On Fri, Sep 22, 2006 at 01:52:02PM -0400, Jim Nasby wrote:

I thought folks might be interested in this... note in particular the
comment about linux.

...

From: Greg 'groggy' Lehey <grog@FreeBSD.org>
Date: June 26, 2006 11:34:12 PM EDT
To: leo huang <leo.huang.list@gmail.com>
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
...
My understanding from the last time I looked at the code was that
fsync does the right thing:

The fsync() system call causes all modified data and
attributes of fd to
be moved to a permanent storage device. This normally results
in all in-
core modified copies of buffers for the associated file to be
written to
a disk.

This is not the case for Linux, where fsync syncs the entire file
system. That could explain some of the performance difference, but
not all of it. I suppose it's worth noting that, in general, people
report much better performance with MySQL on Linux than on FreeBSD.

I see Greg's comment as contradictory. People see better performance with
MySQL on Linux than on FreeBSD, fsync() on Linux syncs the whole file
system?

I don't believe that fsync() on Linux syncs the whole file system
either. This sounds made up, or a confusion with 'sync'. Perhaps
people @FreeBSD.org are not as familiar with Linux.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#2)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

mark@mark.mielke.cc writes:

I don't believe that fsync() on Linux syncs the whole file system
either.

Indeed. I'd disregard this as coming from someone who knows much
less than he thinks.

(The most likely explanation for his results, I expect, is that FreeBSD
is trying to fsync and the disk drive is lying to it, whereas on his
comparison Linux machine the drive is not configured to lie about
write-complete.)

regards, tom lane

#4AgentM
agentm@themactionfaction.com
In reply to: Noname (#2)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

On Sep 22, 2006, at 15:00 , mark@mark.mielke.cc wrote:

On Fri, Sep 22, 2006 at 01:52:02PM -0400, Jim Nasby wrote:

I thought folks might be interested in this... note in particular the
comment about linux.

...

From: Greg 'groggy' Lehey <grog@FreeBSD.org>
Date: June 26, 2006 11:34:12 PM EDT
To: leo huang <leo.huang.list@gmail.com>
Cc: freebsd-performance@freebsd.org
Subject: Re: Is the fsync() fake on FreeBSD6.1?
...
My understanding from the last time I looked at the code was that
fsync does the right thing:

The fsync() system call causes all modified data and
attributes of fd to
be moved to a permanent storage device. This normally results
in all in-
core modified copies of buffers for the associated file to be
written to
a disk.

This is probably the same issue that the hackers encountered on
Darwin- namely fsync() flushes the kernel cache, but a further
function call was needed to flush the hard drive buffers. This meets
the standard's definition of fsync because the data is indeed moved
to the device, but it happens to just be the device's buffer instead
of non-volatile storage.

-M

#5Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Jim Nasby (#1)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

On 2006-09-22, Jim Nasby <jim@nasby.net> wrote:

I thought folks might be interested in this... note in particular the
comment about linux.

I don't believe that either person in that discussion knows what they are
really talking about.

fsync() on FreeBSD does, as is required, force any modified data for the
file, plus any metadata, plus any modifications to any parent directories,
to the underlying disk device and waits for that device to report the
write as complete.

Whether the underlying device lies about the write completion is another
matter. All current SCSI disks have WCE enabled by default, which means
that they will lie about write completion if FUA was not set in the
request, which FreeBSD never sets. (It's not possible to get correct
results by having fsync() somehow selectively set FUA, because that would
leave previously-completed requests in the cache.)

WCE can be disabled on either a temporary or permanent basis by changing
the appropriate modepage. It's possible that Linux does this automatically,
or sets FUA on all writes, though that would surprise me considerably;
however I disclaim any knowledge of Linux internals.

On FreeBSD, this command will disable WCE permanently on a SCSI drive:

echo 'WCE: 0' | camcontrol modepage daXX -m 8 -P3 -e

(use -P0 to disable it only temporarily, or you can use just the second of
those commands alone to interactively edit the mode page)

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew - Supernews (#5)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

Andrew - Supernews <andrew+nonews@supernews.com> writes:

Whether the underlying device lies about the write completion is another
matter. All current SCSI disks have WCE enabled by default, which means
that they will lie about write completion if FUA was not set in the
request, which FreeBSD never sets.

Huh? The entire point of the SCSI command set is that it's not
necessary to lie about write completion for performance reasons, because
the architecture has always supported the concept of multiple requests
in-flight concurrently. Has the disk drive industry gotten a whole lot
stupider in the fifteen years since I last wrote a SCSI driver?

regards, tom lane

#7Andrew - Supernews
andrew+nonews@supernews.com
In reply to: Jim Nasby (#1)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

On 2006-09-23, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew - Supernews <andrew+nonews@supernews.com> writes:

Whether the underlying device lies about the write completion is another
matter. All current SCSI disks have WCE enabled by default, which means
that they will lie about write completion if FUA was not set in the
request, which FreeBSD never sets.

Huh? The entire point of the SCSI command set is that it's not
necessary to lie about write completion for performance reasons, because
the architecture has always supported the concept of multiple requests
in-flight concurrently.

I seem to recall we've had this conversation previously.

Has the disk drive industry gotten a whole lot
stupider in the fifteen years since I last wrote a SCSI driver?

Quite possibly, yes.

I certainly would never claim that WCE is a good idea, or that having it
enabled by default is a good idea, I merely report the _fact_ that it is
indeed enabled by default on every SCSI drive that I have recently
encountered (over several different vendors).

On my database machines I am careful to disable it (and check that this
does indeed take effect). I would recommend that others do likewise. The
performance impact of disabling WCE is not serious (other than removing
the unsafe speed gains of course).

Since posting the previous response I've been directed to a document that
seems to imply that Linux drivers now attempt to handle write-order
guarantees by introducing the concept of a "write barrier", i.e. a write
request which must complete after all previous writes and before all
subsequent ones. Achieving this requires different strategies depending
on whether the underlying device allows command-queueing and/or exposes a
useful cache flush command; the implication of this is that for SCSI disks
with WCE, the linux driver will actually send SYNCHRONIZE CACHE when doing
a write barrier (which could be expensive of course). If (and I have no
idea if this is true) fsync() is implemented by means of such a barrier,
then this implies that an fsync()-heavy workload will perform much worse
on Linux when WCE is enabled than when it is disabled, since in the latter
case the driver will not issue SYNCHRONIZE CACHE and will simply ensure
that the relevent writes are all completed.

It would be interesting to see benchmarks of this.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

#8Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Andrew - Supernews (#5)
Re: Fwd: Is the fsync() fake on FreeBSD6.1?

Andrew - Supernews wrote:

Whether the underlying device lies about the write completion is another
matter. All current SCSI disks have WCE enabled by default, which means
that they will lie about write completion if FUA was not set in the
request, which FreeBSD never sets. (It's not possible to get correct
results by having fsync() somehow selectively set FUA, because that would
leave previously-completed requests in the cache.)

WCE can be disabled on either a temporary or permanent basis by changing
the appropriate modepage. It's possible that Linux does this automatically,
or sets FUA on all writes, though that would surprise me considerably;
however I disclaim any knowledge of Linux internals.

The Linux SATA driver author Jeff Garzik suggests [note 1] that
"The ability of a filesystem or fsync(2) to cause a [FLUSH|SYNC] CACHE
command to be generated has only been present in the most recent [as of
mid 2005] 2.6.x kernels. See the "write barrier" stuff that people
have been discussing. "Furthermore, read-after-write implies nothing
at all. The only way to you can be assured that your data has "hit
the platter" is
(1) issuing [FLUSH|SYNC] CACHE, or
(2) using FUA-style disk commands
It sounds like your test (or reasoning) is invalid.
"

Before those min-2005 2.6.x kernels apparently fsync on linux didn't
really try to flush caches even when drives supported it (which
apparently most actually do if the requests are actually sent).

[note 1] http://lkml.org/lkml/2005/5/15/82