Fwd: Apple Darwin disabled fsync?
Show quoted text
Date: Sat, 19 Feb 2005 17:59:21 -0800
From: Dominic Giampaolo <dbg@apple.com>
Subject: Re: bad fsync? (A.M.)
To: darwin-dev@lists.apple.comMySQL makes the following claim at:
http://dev.mysql.com/doc/mysql/en/news-4-1-9.html"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3
and up. Apple had disabled fsync() in Mac OS X for internal disk
drives, which caused corruption at power outages."First of all, is this accurate? A pointer to some docs or a tech note
on this would be helpful.The comments about fsync() are wrong...
On MacOS X, fsync() always has and always will flush all file data
from host memory to the drive on which the file resides. The behavior
of fsync() on MacOS X is the same as it is on every other version of
Unix since the dawn of time (well, since the introduction of fsync
anyway :-).I believe that what the above comment refers to is the fact that
fsync() is not sufficient to guarantee that your data is on stable
storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
to ask the drive to flush all buffered data to stable storage.Let me explain in more detail. With fsync() even though the OS
writes the data through to the disk and the disk says "yes I wrote
the data", the data is not actually on permanent storage. Unless
you explicitly disable it, all disks have a write buffer which holds
data you've written. The disk buffers the data you wrote until it
decides to flush it to the platters (and the writes may not be in
the order you wrote them). If you lose power or the system crashes
before the data is written, you can wind up in a situation where only
some of your data is actually on disk. What is worse is that even if
you write blocks A, B and C, call fsync() and then write block D you
may find after rebooting that blocks A and D are on disk but B and C
are not (in fact any ordering of A, B, C, and D is possible).While this may seem like a rare case it is not. In fact if you sit
down and pull the plug on a system you can make it happen in one or
two plug pulls. I have even gone so far as to watch this behavior
with a logic analyzer on the ATA bus: I saw the data for two writes
come across the ATA cable, the drive replied and said the writes were
successful and then when we rebooted the data from the second write
was correct on disk but the data from the first write was not.To deal with this we introduced the F_FULLFSYNC fcntl which will ask
the drive to flush all of its buffered data to disk. When an app
needs to guarantee that data is on disk it should use F_FULLFSYNC.
In most cases you do not need such a heavy handed operation and
fsync() is good enough. But in an app like a database, it is
essential if you want transactional integrity.Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC
with the FLUSH_TRACK_CACHE command. All drives sold by Apple will
honor this command. Unfortunately quite a few firewire drive vendors
disable this command and do not pass it to the drive. This means that
most external firewire drives are not reliable if you lose power or
the system crashes. We can't work-around that unless we ask the drive
to disable the write cache completely (which hurts performance quite
badly -- and even that may not be enough as some drives will ignore
that request too).So in summary, I believe that the comments in the MySQL news posting
are slightly confused. On MacOS X fsync() behaves the same as it does
on all Unices. That's not good enough if you really care about data
integrity and so we also provide the F_FULLFSYNC fcntl. As far as I
know, MacOS X is the only OS to provide this feature for apps that
need to truly guarantee their data is on disk.Hope this clears things up.
--dominic
Peter Bierman <bierman@apple.com> writes:
I believe that what the above comment refers to is the fact that
fsync() is not sufficient to guarantee that your data is on stable
storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
to ask the drive to flush all buffered data to stable storage.
I've been looking for documentation on this without a lot of luck
("man fcntl" on OS X 10.3.8 has certainly never heard of it).
It's not completely clear whether this subsumes fsync() or whether
you're supposed to fsync() and then use the fcntl.
Also, isn't it fundamentally at the wrong level? One would suppose that
the drive flush operation is going to affect everything the drive
currently has queued, not just the one file. That makes it difficult
if not impossible to use efficiently.
regards, tom lane
Peter Bierman <bierman@apple.com> writes:
In most cases you do not need such a heavy handed operation and fsync() is
good enough.
Really? Can you think of a single application for which this definition of
fsync is useful?
Kernel buffers are transparent to the application, just as the disk buffer is.
It doesn't matter to an application whether the data is sitting in a kernel
buffer, or a buffer in the disk, it's equivalent. If fsync doesn't guarantee
the writes actually end up on non-volatile disk then as far as the application
is concerned it's just an expensive noop.
--
greg
At 12:38 AM -0500 2/20/05, Tom Lane wrote:
Dominic Giampaolo <dbg@apple.com> writes:
I believe that what the above comment refers to is the fact that
fsync() is not sufficient to guarantee that your data is on stable
storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
to ask the drive to flush all buffered data to stable storage.I've been looking for documentation on this without a lot of luck
("man fcntl" on OS X 10.3.8 has certainly never heard of it).
It's not completely clear whether this subsumes fsync() or whether
you're supposed to fsync() and then use the fcntl.
My understanding is that you're supposed to fsync() and then use the
fcntl, but I'm not the filesystems expert. (Dominic, who wrote the
original message that I forwarded, is.)
I've filed a bug report asking for better documentation about this to
be placed in the fsync man page. <radar://4012378>
Also, isn't it fundamentally at the wrong level? One would suppose that
the drive flush operation is going to affect everything the drive
currently has queued, not just the one file. That makes it difficult
if not impossible to use efficiently.
I think the intent is to make the fcntl more accurate in time, as the
ability to do so appears in hardware.
One of the advantages Apple has is the ability to set very specific
requirements for our hardware. So if a block specific flush command
becomes part of the ATA spec, Apple can require vendors to support
it, and support it correctly, before using those drives.
On the other hand, as Dominic described, once the hardware is
external (like a firewire enclosure), we lose that leverage.
At 12:42 PM -0500 2/20/05, Greg Stark wrote:
Dominic Giampaolo <dbg@apple.com> writes:
In most cases you do not need such a heavy handed operation and fsync() is
good enough.Really? Can you think of a single application for which this definition of
fsync is useful?Kernel buffers are transparent to the application, just as the disk buffer is.
It doesn't matter to an application whether the data is sitting in a kernel
buffer, or a buffer in the disk, it's equivalent. If fsync doesn't guarantee
the writes actually end up on non-volatile disk then as far as the application
is concerned it's just an expensive noop.
I think the intent of fsync() is closer to what you describe, but the
convention is that fsync() hands responsibility to the disk hardware.
That's how every other Unix seems to handle fsync() too. This gives
you good performance, and if you combine a smart fsync()ing
application with reliable storage hardware (like an XServe RAID that
battery backs it's own write caches), you get the best combination.
If you know you have unreliable hardware, and critical reliability
issues, then you can use the fcntl, which seems to be more control
than other OSes give.
-pmb
Peter Bierman <bierman@apple.com> writes:
I think the intent of fsync() is closer to what you describe, but the
convention is that fsync() hands responsibility to the disk hardware.
The "convention" was also that the hardware didn't confirm the command until
it had actually been executed...
None of this matters to the application. A specification for fsync(2) that
says it forces the data to be shuffled around under the hood but fundamentally
the doesn't change the semantics (that the data isn't guaranteed to be in
non-volatile storage) means that fsync didn't really do anything.
--
greg
On Sun, Feb 20, 2005 at 10:50:35PM -0500, Greg Stark wrote:
Peter Bierman <bierman@apple.com> writes:
I think the intent of fsync() is closer to what you describe, but the
convention is that fsync() hands responsibility to the disk hardware.The "convention" was also that the hardware didn't confirm the command until
it had actually been executed...None of this matters to the application. A specification for fsync(2) that
says it forces the data to be shuffled around under the hood but fundamentally
the doesn't change the semantics (that the data isn't guaranteed to be in
non-volatile storage) means that fsync didn't really do anything.
The real issue is this isn't specific to OS X. I know FreeBSD enables
write-caching on IDE drives by default, and I suspect linux does as
well. It's probably worth adding a big, fat WARNING in the docs in
strategic places about this.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
I think we should add a new wal_sync_method that will use Darwin's
F_FULLFSYNC fcntl().
From <sys/fnctl.h>:
#define F_FULLFSYNC 51 /* fsync + ask the drive to
flush to the media */
This fcntl() will basically perform an fsync() on the file, then flush
the write cache of the disk.
I'll attempt to work up the patch. It should be trivial. Might need
some help on the configure tests though (it should #include
<sys/fcntl.h> and make sure F_FULLFSYNC is defined).
What's an appropriate name? It seems equivalent to
"fsync_writethrough". I suggest "fsync_full", "fsync_flushdisk", or
something. Is there a reason we're not indicating the supported
platform in the name of the method? Would "fsync_darwinfull" be better?
Let users know that it's only available for Darwin? Should we do the
same thing with win32-specific methods?
I think both fsync() and F_FULLFSYNC should both be available as
options on Darwin. Currently in the code, "fsync" and
"fsync_writethrough" set sync_method to SYNC_METHOD_FSYNC, so there's
no way to distinguish between them.
Unsure which one would be the best default. fsync() matches the
semantics on other platforms. And conscientious users could specify the
F_FULLFSYNC fcntl() method if they want to make sure it goes through
the write cache.
Comments?
Thanks!
- Chris