[PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Started by Marti Raudseppabout 15 years ago30 messages

marti@juffo.org

about 15 years ago

1 attachment(s)

Hi list,

PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior. Users might be in for a nasty
performance surprise when re-building their Postgres with newer Linux
headers (as was I), so I propose that this change should be made in
all supported releases.

-- commit message --
Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Linux kernel headers from 2.6.33 (and later) change the behavior of the
O_SYNC flag. Previously O_SYNC was aliased to O_DSYNC, which caused
PostgreSQL to use fdatasync as the default instead.

Starting with kernels 2.6.33 and later, the definitions of O_DSYNC and
O_SYNC differ. When built with headers from these newer kernels,
PostgreSQL will default to using open_datasync. This patch reverts the
Linux default to fdatasync, which has had much more testing over time
and also significantly better performance.
-- end commit message --

Earlier kernel headers defined O_SYNC and O_DSYNC to 0x1000
2.6.33 and later define O_SYNC=0x101000 and O_DSYNC=0x1000 (since old
behavior on most FS-es was always equivalent to POSIX O_DSYNC)

More details at:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b2f3d1f769be5779b479c37800229d9a4809fc3

Currently PostgreSQL's include/access/xlogdefs.h defaults to using
open_datasync when O_SYNC != O_DSYNC, otherwise fdatasync is used.

Since other platforms might want to default to fdatasync in the
future, too, I defined a new PLATFORM_DEFAULT_SYNC_METHOD constant in
include/port/linux.h. I don't know if this is the best way to do it.

Regards,
Marti

Attachments:

0001-Revert-default-wal_sync_method-to-fdatasync-on-Linux.patchtext/x-patch; charset=US-ASCII; name=0001-Revert-default-wal_sync_method-to-fdatasync-on-Linux.patchDownload

From fbf61c6536b7060e5c6745c8221a5a4fb9a53c92 Mon Sep 17 00:00:00 2001
From: Marti Raudsepp <marti@juffo.org>
Date: Fri, 5 Nov 2010 17:40:22 +0200
Subject: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Linux kernel headers from 2.6.33 (and later) change the behavior of the
O_SYNC flag. Previously O_SYNC was aliased to O_DSYNC, which caused
PostgreSQL to use fdatasync as the default instead.

Starting with kernels 2.6.33 and later, the definitions of O_DSYNC and
O_SYNC differ. When built with headers from these newer kernels,
PostgreSQL will default to using open_datasync. This patch reverts the
Linux default to fdatasync, which has had much more testing over time
and also significantly better performance.
---
 src/include/access/xlogdefs.h |    5 +++++
 src/include/port/linux.h      |    9 +++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 18b214e..7ed1bde 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -13,6 +13,7 @@
 #define XLOG_DEFS_H
 
 #include <fcntl.h>				/* need open() flags */
+#include "pg_config_os.h"
 
 /*
  * Pointer to a location in the XLOG.  These pointers are 64 bits wide,
@@ -123,6 +124,9 @@ typedef uint32 TimeLineID;
 #endif
 #endif
 
+#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
+#define DEFAULT_SYNC_METHOD PLATFORM_DEFAULT_SYNC_METHOD
+#else							/* !defined(PLATFORM_DEFAULT_SYNC_METHOD) */
 #if defined(OPEN_DATASYNC_FLAG)
 #define DEFAULT_SYNC_METHOD		SYNC_METHOD_OPEN_DSYNC
 #elif defined(HAVE_FDATASYNC)
@@ -132,6 +136,7 @@ typedef uint32 TimeLineID;
 #else
 #define DEFAULT_SYNC_METHOD		SYNC_METHOD_FSYNC
 #endif
+#endif
 
 /*
  * Limitation of buffer-alignment for direct IO depends on OS and filesystem,
diff --git a/src/include/port/linux.h b/src/include/port/linux.h
index b9498b2..5a88590 100644
--- a/src/include/port/linux.h
+++ b/src/include/port/linux.h
@@ -12,3 +12,12 @@
  * to have a kernel version test here.
  */
 #define HAVE_LINUX_EIDRM_BUG
+
+/*
+ * The definition of O_SYNC changed in Linux kernel headers starting with
+ * 2.6.33.  This caused PostgreSQL to change the wal_sync_method default from
+ * fdatasync to open_datasync.  Since fdatasync is more tested on Linux and has
+ * better performance, default to fdatasync on Linux.
+ */
+#define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
+
-- 
1.7.3.2

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Marti Raudsepp (#1)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Marti Raudsepp <marti@juffo.org> writes:

PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.

I'm less than convinced this is the right approach ...

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

regards, tom lane

Marti Raudsepp

marti@juffo.org

about 15 years ago

In reply to: Tom Lane (#2)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm less than convinced this is the right approach ...

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

Sure, maybe for PostgreSQL 9.1

But the immediate problem is older releases (8.1 - 9.0) specifically
on Linux. Something as innocuous as re-building your DB on a newer
kernel will radically affect performance -- even when the DB kernel
didn't change.

So I think we should aim to fix old versions first. Do you disagree?

Regards,
Marti

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Marti Raudsepp (#3)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Marti Raudsepp <marti@juffo.org> writes:

On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

So I think we should aim to fix old versions first. Do you disagree?

What's that got to do with it?

regards, tom lane

Marti Raudsepp

marti@juffo.org

about 15 years ago

In reply to: Tom Lane (#4)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marti Raudsepp <marti@juffo.org> writes:

On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

So I think we should aim to fix old versions first. Do you disagree?

What's that got to do with it?

I'm not sure what you're asking.

Surely changing the default wal_sync_method for all OSes in
maintenance releases is out of the question, no?

Regards,
Marti

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Marti Raudsepp (#5)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Marti Raudsepp <marti@juffo.org> writes:

On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What's that got to do with it?

I'm not sure what you're asking.

Surely changing the default wal_sync_method for all OSes in
maintenance releases is out of the question, no?

Well, if we could leave well enough alone it would be fine with me,
but I think our hand is being forced by the Linux kernel hackers.
I don't really think that "change the default on Linux" is that
much nicer than "change the default everywhere" when it comes to
what we ought to consider back-patching. In any case, you're getting
ahead of the game: we need to decide on the desired behavior first and
then think about what to patch. Do the performance results that were
cited show that open_dsync is generally inferior to fdatasync? If so,
why would we think that that conclusion is Linux-specific?

regards, tom lane

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Tom Lane (#2)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On Friday 05 November 2010 19:13:47 Tom Lane wrote:

Marti Raudsepp <marti@juffo.org> writes:

PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.

I'm less than convinced this is the right approach ...

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

I fail to see how it could be beneficial on *any* non-buggy platform.
Especially with small wal_buffers and larger commits (but also otherwise) it
increases the amount of synchronous writes the os has to do tremendously.

* It removes about all benefits of XLogBackgroundFlush()
* It removes any chances of reordering after writing.
* It makes AdvanceXLInsertBuffer synchronous if it has to write outy

Whats the theory about placing it so high in the preferences list?

Andres

Marti Raudsepp

marti@juffo.org

about 15 years ago

In reply to: Tom Lane (#6)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On Fri, Nov 5, 2010 at 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't really think that "change the default on Linux" is that
much nicer than "change the default everywhere" when it comes to
what we ought to consider back-patching. In any case, you're getting
ahead of the game: we need to decide on the desired behavior first and
then think about what to patch.

We should be trying to guarantee the stability of maintenance
releases. "Stability" includes consistent defaults. The fact that
Linux now distinguishes between these two flags has a very surprising
effect on PostgreSQL's defaults; an effect that wasn't intended by any
developer, is not documented anywhere, and certainly won't be
anticipated by users.

Do you reject this premise?

As newer distros are adopting 2.6.33+ kernels, more and more people
will shoot themselves in the foot by this change. I am also worried
that it will have a direct effect on PostgreSQL adoption.

Regards,
Marti

Greg Smith

greg@2ndquadrant.com

about 15 years ago

In reply to: Tom Lane (#2)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Tom Lane wrote:

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

And I've seen the expected sync write performance gain over fdatasync on
a system with a battery-backed cache running VxFS on Linux, because
working open_[d]sync means O_DIRECT writes bypassing the OS cache, and
therefore reducing cache pollution from WAL writes. This doesn't work
by default on Solaris because they have a special system call you have
to execute for direct output, but if you trick the OS into doing that
via mount options you can observe it there too. The last serious tests
of this area I saw on that platform were from Jignesh, and they
certainly didn't show a significant performance regression running in
sync mode. I vaguely recall seeing a set once that showed a minor loss
compared to fdatasync, but it was too close to make any definitive
statement about reordering.

I haven't seen any report yet of a serious performance regression in the
new Linux case that was written by someone who understands fully how
fsync and drive cache flushing are supposed to interact. It's been
obvious for a year now that the reports from Phoronix about this had no
idea what they were actually testing. I didn't see anything from
Marti's report that definitively answers whether this is anything other
than Linux finally doing the right thing to flush drive caches out when
sync writes happen. There may be a performance regression here related
to WAL data going out in smaller chunks than it used to, but in all the
reports I've seen it that hasn't been isolated well enough to consider
making any changes yet--to tell if it's a performance loss or a
reliability gain we're seeing.

I'd like to see some output from the 9.0 test_fsync on one of these
RHEL6 systems on a system without a battery backed write cache as a
first step here. That should start to shed some light on what's
happening. I just bumped up the priority on the pending upgrade of my
spare laptop to the RHEL6 beta I had been trying to find time for, so I
can investigate this further myself.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD

#10

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Andres Freund (#7)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Andres Freund <andres@anarazel.de> writes:

On Friday 05 November 2010 19:13:47 Tom Lane wrote:

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

I fail to see how it could be beneficial on *any* non-buggy platform.
Especially with small wal_buffers and larger commits (but also otherwise) it
increases the amount of synchronous writes the os has to do tremendously.

* It removes about all benefits of XLogBackgroundFlush()
* It removes any chances of reordering after writing.
* It makes AdvanceXLInsertBuffer synchronous if it has to write outy

Whats the theory about placing it so high in the preferences list?

I think the original idea was that if you had a dedicated WAL drive then
sync-on-write would be reasonable. But that was a very long time ago
and I'm not sure that the system's behavior is anything like what it was
then; for that matter I'm not sure we had proof that it was an optimal
choice even back then. That's why I want to revisit the choice of
default and not just go for "minimum" change.

regards, tom lane

#11

Andres Freund

andres@anarazel.de

about 15 years ago

In reply to: Greg Smith (#9)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On Friday 05 November 2010 22:53:37 Greg Smith wrote:

If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?

And I've seen the expected sync write performance gain over fdatasync on
a system with a battery-backed cache running VxFS on Linux, because
working open_[d]sync means O_DIRECT writes bypassing the OS cache, and
therefore reducing cache pollution from WAL writes.

Which looks like a setup where you definitely need to know what you do. I.e.
don't need support from wal_sync_method by default being open_fdatasync...

Andres

#12

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Tom Lane (#10)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

I think the original idea was that if you had a dedicated WAL drive then
sync-on-write would be reasonable. But that was a very long time ago
and I'm not sure that the system's behavior is anything like what it was
then; for that matter I'm not sure we had proof that it was an optimal
choice even back then. That's why I want to revisit the choice of
default and not just go for "minimum" change.

What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#13

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Josh Berkus (#12)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Josh Berkus <josh@agliodbs.com> writes:

What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?

At least. I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.

regards, tom lane

#14

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Tom Lane (#13)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On 11/5/10 3:31 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?

At least. I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.

I could test at least 1 version of Solaris, I think.

Greg, any recommendations on pgbench parameters?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#15

Greg Smith

greg@2ndquadrant.com

about 15 years ago

In reply to: Tom Lane (#13)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Tom Lane wrote:

I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.

It's not coincidence that the chapter of my book I convinced the
publisher to release as a sample is the one that covers this area; this
mess has been visibly approaching for some time now. I'm going to put
RHEL6 onto a system and start collecting some proper slowdown numbers
this week, then pass along a suggested test regime for others.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD

#16

Greg Smith

greg@2ndquadrant.com

about 15 years ago

In reply to: Marti Raudsepp (#1)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Marti Raudsepp wrote:

PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php

NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.

Following our standard community development model, I've put this patch
onto our CommitFest list:
https://commitfest.postgresql.org/action/patch_view?id=432 and assigned
myself as the reviewer. I didn't look at this until now because I
already had some patch development and review work to finish before the
CommitFest deadline we just crossed. Now I can go back to reviewing
other people's work.

P.S. There is no pgsql-patch list anymore; everything goes through the
hackers list now.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#17

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Andres Freund (#7)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

All,

So, this week I've had my hands on a medium-high-end test system where I
could test various wal_sync_methods. This is a 24-core Intel Xeon
machine with 72GB of ram, and 8 internal 10K SAS disks attached to a
raid controller with 512MB BBU write cache. 2 of the disks are in a
RAID1, which supports both an Ext4 partition and an XFS partition. The
remaining disks are in a RAID10 which only supports a single pgdata
partition.

This is running on RHEL6, Linux Kernel: 2.6.32-71.el6.x86_64

I think this kind of a system much better represents our users who are
performance-conscious than testing on people's laptops or on VMs does.

I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory. I think
the second change should be permanent; I imagine that a lot of people
who are running test_fsync are not aware that they're actually testing
the performance of /var/tmp, not whatever FS mount they wanted to test.

Here's the results. I think you'll agree that, at least on Linux, the
benefits of o_sync and o_dsync as defaults would be highly questionable.
Particularly, it seems that if O_DIRECT support is absent, fdatasync is
across-the-board faster:

=============

test_fsync with directIO, on 2 drives, XFS tuned:

Loops = 10000

Simple write:
8k write 198629.457/second

Compare file sync methods using one write:
open_datasync 8k write 14798.263/second
open_sync 8k write 14316.864/second
8k write, fdatasync 12198.871/second
8k write, fsync 12371.843/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 7362.805/second
2 open_sync 8k writes 7156.685/second
8k write, 8k write, fdatasync 10613.525/second
8k write, 8k write, fsync 10597.396/second

Compare open_sync with different sizes:
open_sync 16k write 13631.816/second
2 open_sync 8k writes 7645.038/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 11427.096/second
8k write, close, fsync 11321.220/second

test_fsync with directIO, on 6 drives RAID10, XFS tuned:

Loops = 10000

Simple write:
8k write 196494.537/second

Compare file sync methods using one write:
open_datasync 8k write 14909.974/second
open_sync 8k write 14559.326/second
8k write, fdatasync 11046.025/second
8k write, fsync 11046.916/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 7349.223/second
2 open_sync 8k writes 7667.395/second
8k write, 8k write, fdatasync 9560.495/second
8k write, 8k write, fsync 9557.287/second

Compare open_sync with different sizes:
open_sync 16k write 12060.049/second
2 open_sync 8k writes 7650.746/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 9377.107/second
8k write, close, fsync 9251.233/second

test_fsync without directIO on RAID1, Ext4, data=journal:

Loops = 10000

Simple write:
8k write 150514.005/second

Compare file sync methods using one write:
open_datasync 8k write 4012.070/second
open_sync 8k write 5476.898/second
8k write, fdatasync 5512.649/second
8k write, fsync 5803.814/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 2910.401/second
2 open_sync 8k writes 2817.377/second
8k write, 8k write, fdatasync 5041.608/second
8k write, 8k write, fsync 5155.248/second

Compare open_sync with different sizes:
open_sync 16k write 4895.956/second
2 open_sync 8k writes 2720.875/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 4724.052/second
8k write, close, fsync 4694.776/second

test_fsync without directIO on RAID1, XFS, tuned:

Loops = 10000

Simple write:
8k write 199796.208/second

Compare file sync methods using one write:
open_datasync 8k write 12553.525/second
open_sync 8k write 12535.978/second
8k write, fdatasync 12268.298/second
8k write, fsync 12305.875/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 6323.835/second
2 open_sync 8k writes 6285.169/second
8k write, 8k write, fdatasync 10893.756/second
8k write, 8k write, fsync 10752.607/second

Compare open_sync with different sizes:
open_sync 16k write 11053.510/second
2 open_sync 8k writes 6293.270/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 11087.482/second
8k write, close, fsync 11157.477/second

test_fsync without directIO on RAID10, 6 drives, XFS Tuned:

Loops = 10000

Simple write:
8k write 197262.003/second

Compare file sync methods using one write:
open_datasync 8k write 12784.699/second
open_sync 8k write 12684.512/second
8k write, fdatasync 12404.547/second
8k write, fsync 12452.757/second

Compare file sync methods using two writes:
2 open_datasync 8k writes 6376.587/second
2 open_sync 8k writes 6364.113/second
8k write, 8k write, fdatasync 9895.699/second
8k write, 8k write, fsync 9866.886/second

Compare open_sync with different sizes:
open_sync 16k write 10156.491/second
2 open_sync 8k writes 6400.889/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 11142.620/second
8k write, close, fsync 11076.393/second

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#18

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Josh Berkus (#17)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

All,

While I have this machine available I've been trying to run some
performance tests using pgbench and various wal_sync_methods. However,
I seem to be maxing out at the speed of pgbench itself; no matter which
wal_sync_method I use (including "fsync"), it tops out at around 2750 TPS.

Of course, it's also possible that the wal_sync_method does not in fact
make a difference in throughput.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#19

Greg Smith

greg@2ndquadrant.com

about 15 years ago

In reply to: Josh Berkus (#17)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Josh Berkus wrote:

I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory.

Patch please? I agree with the latter change; what test_fsync does is
surprising.

I suggested a while ago that we refactor test_fsync to use a common set
of source code as the database itself for detecting things related to
wal_sync_method, perhaps just extract that whole set of DEFINE macro
logic to somewhere else. That happened at a bad time in the development
cycle (right before a freeze) and nobody ever got back to the idea
afterwards. If this code is getting touched, and it's clear it is in
some direction, I'd like to see things change so it's not possible for
the two to diverge again afterwards.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

#20

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Greg Smith (#19)

1 attachment(s)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On 12/5/10 2:12 PM, Greg Smith wrote:

Josh Berkus wrote:

I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory.

Patch please? I agree with the latter change; what test_fsync does is
surprising.

Attached.

Making it support O_DIRECT would be possible but more complex; I don't
see the point unless we think we're going to have open_sync_with_odirect
as a seperate option.

I suggested a while ago that we refactor test_fsync to use a common set
of source code as the database itself for detecting things related to
wal_sync_method, perhaps just extract that whole set of DEFINE macro
logic to somewhere else. That happened at a bad time in the development
cycle (right before a freeze) and nobody ever got back to the idea
afterwards. If this code is getting touched, and it's clear it is in
some direction, I'd like to see things change so it's not possible for
the two to diverge again afterwards.

I don't quite follow you. Maybe nobody else did last time, either.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

Attachments:

test_fsync.patchtext/x-patch; name=test_fsync.patchDownload

diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index 28c2119..12a83e1 100644
*** a/src/tools/fsync/test_fsync.c
--- b/src/tools/fsync/test_fsync.c
***************
*** 23,34 ****
  #include <string.h>
  
  
! #ifdef WIN32
  #define FSYNC_FILENAME	"./test_fsync.out"
- #else
- /* /tmp might be a memory file system */
- #define FSYNC_FILENAME	"/var/tmp/test_fsync.out"
- #endif
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
--- 23,32 ----
  #include <string.h>
  
  
! /* put the temp files in the local directory
!    this is a change from older versions which used
!    /var/tmp */
  #define FSYNC_FILENAME	"./test_fsync.out"
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */

#21

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Josh Berkus (#20)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Josh Berkus <josh@agliodbs.com> writes:

Making it support O_DIRECT would be possible but more complex; I don't
see the point unless we think we're going to have open_sync_with_odirect
as a seperate option.

Whether it's complex or not isn't really the issue. The issue is that
what test_fsync is testing had better match what the backend does, or
people will be making choices based on not-comparable test results.
I think we should have test_fsync just automatically fold in O_DIRECT
the same way the backend does.

regards, tom lane

#22

Andrew Dunstan

andrew@dunslane.net

about 15 years ago

In reply to: Tom Lane (#21)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On 12/06/2010 08:38 PM, Tom Lane wrote:

Josh Berkus<josh@agliodbs.com> writes:

Making it support O_DIRECT would be possible but more complex; I don't
see the point unless we think we're going to have open_sync_with_odirect
as a seperate option.

Whether it's complex or not isn't really the issue. The issue is that
what test_fsync is testing had better match what the backend does, or
people will be making choices based on not-comparable test results.
I think we should have test_fsync just automatically fold in O_DIRECT
the same way the backend does.

Indeed. We were quite confused for a while when we were dealing with
this about a week ago, and my handwritten test program failed as
expected but test_fsync didn't. Anything other than behaving just as the
backend does violates POLA, in my view.

cheers

andrew

#23

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Tom Lane (#21)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Whether it's complex or not isn't really the issue. The issue is that
what test_fsync is testing had better match what the backend does, or
people will be making choices based on not-comparable test results.
I think we should have test_fsync just automatically fold in O_DIRECT
the same way the backend does.

OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails.
What should I have it do instead?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#24

Tom Lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Josh Berkus (#23)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Josh Berkus <josh@agliodbs.com> writes:

OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails.
What should I have it do instead?

Report that it fails, and keep testing the other methods.

regards, tom lane

#25

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Tom Lane (#24)

1 attachment(s)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

On 12/6/10 6:13 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails.
What should I have it do instead?

Report that it fails, and keep testing the other methods.

Patch attached. Includes a fair amount of comment cleanup, since
existing comments did not meet our current project standards. Tests all
6 of the methods we support separately.

Some questions, though:

(1) Why are we doing the open_sync different-size write test? AFAIK,
this doesn't match any behavior which PostgreSQL has.

(2) In this patch, I'm stepping down the number of loops which
fsync_writethrough does by 90%. The reason for that was that on the
platforms where I tested writethrough (desktop machines), doing 10,000
loops took 15-20 *minutes*, which seems hard on the user. Would be easy
to revert if you think it's a bad idea.
Possibly auto-sizing the number of loops based on the first fsync test
might be a good idea, but seems like going a bit too far.

(3) Should the multi-descriptor test be using writethrough on platforms
which support it?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

Attachments:

test_fsync_expanded.patchtext/x-patch; name=test_fsync_expanded.patchDownload

diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile
index 252c087..2ddbbe9 100644
*** a/src/tools/fsync/Makefile
--- b/src/tools/fsync/Makefile
***************
*** 4,10 ****
  #
  # Copyright (c) 2003-2010, PostgreSQL Global Development Group
  #
! # src/tools/fsync/Makefile
  #
  #-------------------------------------------------------------------------
  
--- 4,10 ----
  #
  # Copyright (c) 2003-2010, PostgreSQL Global Development Group
  #
! # $PostgreSQL: pgsql/src/tools/fsync/Makefile,v 1.9 2010/07/05 18:54:38 tgl Exp $
  #
  #-------------------------------------------------------------------------
  
*************** override CPPFLAGS := -I$(libpq_srcdir) $
*** 16,24 ****
  
  OBJS= test_fsync.o
  
! all: test_fsync
  
! test_fsync: test_fsync.o | submake-libpq submake-libpgport
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
  
  clean distclean maintainer-clean:
--- 16,24 ----
  
  OBJS= test_fsync.o
  
! all: submake-libpq submake-libpgport test_fsync
  
! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
  
  clean distclean maintainer-clean:
diff --git a/src/tools/fsync/README b/src/tools/fsync/README
index 6d9acd3..5b45581 100644
*** a/src/tools/fsync/README
--- b/src/tools/fsync/README
***************
*** 1,4 ****
! src/tools/fsync/README
  
  fsync
  =====
--- 1,4 ----
! $PostgreSQL: pgsql/src/tools/fsync/README,v 1.5 2009/11/28 15:04:54 momjian Exp $
  
  fsync
  =====
*************** fsync
*** 6,11 ****
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
  
! Loops defaults to 5000.  The default output file is /var/tmp/test_fsync.out.
! Consider that /tmp or /var/tmp might be memory-based file systems.
--- 6,25 ----
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
+ 	
+ test_fsync is intended to give you a reasonable idea of what the fastest 
+ fsync_method is on your specific system, as well as supplying diagnostic 
+ information in the event of an identified I/O problem.  However, differences
+ shown by test_fsync may not make any difference in real database throughput,
+ especially since many database servers are not speed-limited by their
+ transaction logs.
  
! Filename defaults to test_fsync.out in the current directory. test_fsync
! should be run on the same filesystem where your transaction log currently
! resides.
! 
! Loops default to 10000, except for writethrough tests, where there are 1/10 of
! that in order to make the user not wait forever.  You should lower loops if you
! have a slow system and the tests are taking more than 5 minutes each.  You should
! raise loops if your system is faster than 5000/second, in order to get useful
! statistics.
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index 28c2119..5980b70 100644
*** a/src/tools/fsync/test_fsync.c
--- b/src/tools/fsync/test_fsync.c
***************
*** 3,9 ****
   *
   *
   *	test_fsync.c
!  *		test various fsync() methods
   */
  
  #include "postgres.h"
--- 3,9 ----
   *
   *
   *	test_fsync.c
!  *		tests all supported fsync() methods
   */
  
  #include "postgres.h"
***************
*** 22,55 ****
  #include <unistd.h>
  #include <string.h>
  
! 
! #ifdef WIN32
  #define FSYNC_FILENAME	"./test_fsync.out"
- #else
- /* /tmp might be a memory file system */
- #define FSYNC_FILENAME	"/var/tmp/test_fsync.out"
- #endif
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	"\t%-30s"
  
  int			loops = 10000;
  
  void		die(char *str);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
  
  int
  main(int argc, char *argv[])
  {
  	struct timeval start_t;
  	struct timeval stop_t;
! 	int			tmpfile,
! 				i;
  	char	   *full_buf = (char *) malloc(XLOG_SEG_SIZE),
  			   *buf;
  	char	   *filename = FSYNC_FILENAME;
  
  	if (argc > 2 && strcmp(argv[1], "-f") == 0)
  	{
  		filename = argv[2];
--- 22,58 ----
  #include <unistd.h>
  #include <string.h>
  
! /* 
!  * put the temp files in the local directory
!  * unless the user specifies otherwise 
!  */
  #define FSYNC_FILENAME	"./test_fsync.out"
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	"\t%-30s"
  
  int			loops = 10000;
+ int			writethrough_loops = 1000;
  
  void		die(char *str);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
+ void		print_elapse_writethrough(struct timeval start_t, struct timeval stop_t);
  
  int
  main(int argc, char *argv[])
  {
  	struct timeval start_t;
  	struct timeval stop_t;
! 	int			tmpfile;
! 	int         i;
  	char	   *full_buf = (char *) malloc(XLOG_SEG_SIZE),
  			   *buf;
  	char	   *filename = FSYNC_FILENAME;
  
+ 	/* 
+ 	 * arguments: loops and filename (optional) 
+ 	 */
  	if (argc > 2 && strcmp(argv[1], "-f") == 0)
  	{
  		filename = argv[2];
*************** main(int argc, char *argv[])
*** 57,73 ****
  		argc -= 2;
  	}
  
! 	if (argc > 1)
  		loops = atoi(argv[1]);
  
  	for (i = 0; i < XLOG_SEG_SIZE; i++)
  		full_buf[i] = random();
  
  	if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1)
  		die("Cannot open output file.");
  	if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
  		die("write failed");
! 	/* fsync now so later fsync's don't have to do it */
  	if (fsync(tmpfile) != 0)
  		die("fsync failed");
  	close(tmpfile);
--- 60,88 ----
  		argc -= 2;
  	}
  
! 	/* 
! 	 * set writethrough_loops to be 1/10 of loops
! 	 * since writethroughs are very slow 
! 	 */
! 	if (argc > 1) 
! 	{
  		loops = atoi(argv[1]);
+ 		writethrough_loops = loops / 10;
+ 	}
  
  	for (i = 0; i < XLOG_SEG_SIZE; i++)
  		full_buf[i] = random();
  
+ 	/* 
+ 	 * test if we can open the target file 
+ 	 */
  	if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1)
  		die("Cannot open output file.");
  	if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
  		die("write failed");
! 	/*
! 	 * fsync now so that dirty buffers don't skew later tests
! 	 */
  	if (fsync(tmpfile) != 0)
  		die("fsync failed");
  	close(tmpfile);
*************** main(int argc, char *argv[])
*** 77,83 ****
  	printf("Loops = %d\n\n", loops);
  
  	/*
! 	 * Simple write
  	 */
  	printf("Simple write:\n");
  	printf(LABEL_FORMAT, "8k write");
--- 92,98 ----
  	printf("Loops = %d\n\n", loops);
  
  	/*
! 	 * Test a simple write without fsync
  	 */
  	printf("Simple write:\n");
  	printf(LABEL_FORMAT, "8k write");
*************** main(int argc, char *argv[])
*** 95,104 ****
  	print_elapse(start_t, stop_t);
  
  	/*
! 	 * Compare file sync methods with one 8k write
  	 */
  	printf("\nCompare file sync methods using one write:\n");
  
  #ifdef OPEN_DATASYNC_FLAG
  	printf(LABEL_FORMAT, "open_datasync 8k write");
  	fflush(stdout);
--- 110,122 ----
  	print_elapse(start_t, stop_t);
  
  	/*
! 	 * Test all fsync methods using single 8k writes
  	 */
  	printf("\nCompare file sync methods using one write:\n");
  
+ 	/*
+ 	 * Test open_datasync if available
+ 	 */
  #ifdef OPEN_DATASYNC_FLAG
  	printf(LABEL_FORMAT, "open_datasync 8k write");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 115,124 ****
--- 133,174 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 
+ 	/*
+ 	 * If O_DIRECT is enabled, test that with open_datasync
+ 	 */
+ 	if ( PG_O_DIRECT != 0 )
+ 	{
+ 		printf(LABEL_FORMAT, "open_datasync 8k directIO write");
+ 		fflush(stdout);
+ 		if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
+ 			printf("\t(unavailable: o_direct on this filesystem)\n");
+ 		else
+ 		{
+ 			gettimeofday(&start_t, NULL);
+ 			for (i = 0; i < loops; i++)
+ 			{
+ 				if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 					die("write failed");
+ 				if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 					die("seek failed");
+ 			}
+ 			gettimeofday(&stop_t, NULL);
+ 			close(tmpfile);
+ 			print_elapse(start_t, stop_t);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		printf("\t(unavailable: o_direct)\n");
+ 	}
  #else
  	printf("\t(unavailable: open_datasync)\n");
  #endif
  
+ /*
+  * Test open_sync if available
+  */
  #ifdef OPEN_SYNC_FLAG
  	printf(LABEL_FORMAT, "open_sync 8k write");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 135,144 ****
--- 185,226 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 
+ 	/*
+ 	 * If O_DIRECT is enabled, test that with open_sync
+ 	 */
+ 	if ( PG_O_DIRECT != 0 )
+ 	{
+ 		printf(LABEL_FORMAT, "open_sync 8k directIO write");
+ 		fflush(stdout);
+ 		if ((tmpfile = open(filename, O_RDWR | O_SYNC | PG_O_DIRECT, 0)) == -1)
+ 			printf("\t(unavailable: o_direct on this filesystem)\n");
+ 		else
+ 		{
+ 			gettimeofday(&start_t, NULL);
+ 			for (i = 0; i < loops; i++)
+ 			{
+ 				if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 					die("write failed");
+ 				if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 					die("seek failed");
+ 			}
+ 			gettimeofday(&stop_t, NULL);
+ 			close(tmpfile);
+ 			print_elapse(start_t, stop_t);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		printf("\t(unavailable: o_direct)\n");
+ 	}
  #else
  	printf("\t(unavailable: open_sync)\n");
  #endif
  
+ /*
+  * Test fdatasync if available
+  */
  #ifdef HAVE_FDATASYNC
  	printf(LABEL_FORMAT, "8k write, fdatasync");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 160,165 ****
--- 242,250 ----
  	printf("\t(unavailable: fdatasync)\n");
  #endif
  
+ /*
+  * Test fsync
+  */
  	printf(LABEL_FORMAT, "8k write, fsync");
  	fflush(stdout);
  	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
*************** main(int argc, char *argv[])
*** 177,188 ****
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
  
  	/*
! 	 * Compare file sync methods with two 8k write
  	 */
  	printf("\nCompare file sync methods using two writes:\n");
  
  #ifdef OPEN_DATASYNC_FLAG
  	printf(LABEL_FORMAT, "2 open_datasync 8k writes");
  	fflush(stdout);
--- 262,304 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ /*
+  * If fsync_writethrough is available, test as well
+  * This uses 1/10 the number of loops because it tends
+  * to take forever otherwise.
+  */	
+ #ifdef HAVE_FSYNC_WRITETHROUGH
+ 	printf(LABEL_FORMAT, "8k write, fsync_writethrough");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
+ 		die("Cannot open output file.");
+ 	gettimeofday(&start_t, NULL);
+ 	for (i = 0; i < writethrough_loops; i++)
+ 	{
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (fcntl(tmpfile, F_FULLFSYNC ) != 0)
+ 			die("fsync failed");
+ 		if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 			die("seek failed");
+ 	}
+ 	gettimeofday(&stop_t, NULL);
+ 	close(tmpfile);
+ 	print_elapse_writethrough(start_t, stop_t);
+ #else
+ 	printf("\t(unavailable: fsync_writethrough)\n");
+ #endif
  
  	/*
! 	 * Compare some of the file sync methods with 
! 	 * two 8k writes to see if timing is different
  	 */
  	printf("\nCompare file sync methods using two writes:\n");
  
+ /*
+  * Test open_datasync with and without o_direct
+  */
  #ifdef OPEN_DATASYNC_FLAG
  	printf(LABEL_FORMAT, "2 open_datasync 8k writes");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 201,210 ****
--- 317,354 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ 	if ( PG_O_DIRECT != 0 ) 
+ 	{
+ 		printf(LABEL_FORMAT, "2 open_datasync directIO 8k writes");
+ 		fflush(stdout);
+ 		if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
+ 			die("Cannot open output file.");
+ 		gettimeofday(&start_t, NULL);
+ 		for (i = 0; i < loops; i++)
+ 		{
+ 			if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 				die("write failed");
+ 			if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 				die("write failed");
+ 			if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 				die("seek failed");
+ 		}
+ 		gettimeofday(&stop_t, NULL);
+ 		close(tmpfile);
+ 		print_elapse(start_t, stop_t);
+ 	}
+ 	else
+ 	{
+ 		printf("\t(unavailable: o_direct)\n");
+ 	}
  #else
  	printf("\t(unavailable: open_datasync)\n");
  #endif
  
+ /*
+  * Test open_sync with and without o_direct
+  */
  #ifdef OPEN_SYNC_FLAG
  	printf(LABEL_FORMAT, "2 open_sync 8k writes");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 223,230 ****
--- 367,404 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ 		if ( PG_O_DIRECT != 0 ) 
+ 	{
+ 		printf(LABEL_FORMAT, "2 open_sync directIO 8k writes");
+ 		fflush(stdout);
+ 		if ((tmpfile = open(filename, O_RDWR | O_SYNC | PG_O_DIRECT, 0)) == -1)
+ 			die("Cannot open output file.");
+ 		gettimeofday(&start_t, NULL);
+ 		for (i = 0; i < loops; i++)
+ 		{
+ 			if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 				die("write failed");
+ 			if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 				die("write failed");
+ 			if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 				die("seek failed");
+ 		}
+ 		gettimeofday(&stop_t, NULL);
+ 		close(tmpfile);
+ 		print_elapse(start_t, stop_t);
+ 	}
+ 	else
+ 	{
+ 		printf("\t(unavailable: o_direct)\n");
+ 	}
+ #else
+ 	printf("\t(unavailable: open_sync)\n");
  #endif
  
+ /*
+  *	Test fdatasync
+  */
  #ifdef HAVE_FDATASYNC
  	printf(LABEL_FORMAT, "8k write, 8k write, fdatasync");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 248,253 ****
--- 422,430 ----
  	printf("\t(unavailable: fdatasync)\n");
  #endif
  
+ /*
+  * Test basic fsync
+  */
  	printf(LABEL_FORMAT, "8k write, 8k write, fsync");
  	fflush(stdout);
  	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
*************** main(int argc, char *argv[])
*** 267,278 ****
--- 444,488 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ /*
+  * Test fsync_writethrough if available
+  * Again, using 1/10 as many loops
+  */	
+ #ifdef HAVE_FSYNC_WRITETHROUGH
+ 	printf(LABEL_FORMAT, "8k write, 8k write, fsync_writethrough");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
+ 		die("Cannot open output file.");
+ 	gettimeofday(&start_t, NULL);
+ 	for (i = 0; i < writethrough_loops; i++)
+ 	{
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (fcntl(tmpfile, F_FULLFSYNC) != 0)
+ 			die("fsync failed");
+ 		if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 			die("seek failed");
+ 	}
+ 	gettimeofday(&stop_t, NULL);
+ 	close(tmpfile);
+ 	print_elapse_writethrough(start_t, stop_t);
+ #else
+ 	printf("\t(unavailable: fsync_writethrough)\n");
+ #endif
  
  	/*
  	 * Compare 1 to 2 writes
  	 */
  	printf("\nCompare open_sync with different sizes:\n");
  
+ /*
+  * Test open_sync with different size files
+  * It's unclear why this is in test_fsync, since it's 
+  * not anything PostgreSQL does
+  */
  #ifdef OPEN_SYNC_FLAG
  	printf(LABEL_FORMAT, "open_sync 16k write");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 312,323 ****
  #endif
  
  	/*
! 	 * Fsync another file descriptor?
  	 */
  	printf("\nTest if fsync on non-write file descriptor is honored:\n");
  	printf("(If the times are similar, fsync() can sync data written\n");
  	printf("on a different descriptor.)\n");
  
  	printf(LABEL_FORMAT, "8k write, fsync, close");
  	fflush(stdout);
  	gettimeofday(&start_t, NULL);
--- 522,541 ----
  #endif
  
  	/*
! 	 * Test whether fsync can sync data written on a different
! 	 * descriptor for the same file.  This checks the efficiency
! 	 * of multi-process fsyncs against the same file.
! 	 * Possibly this should be done with writethrough on platforms
! 	 * which support it.
  	 */
  	printf("\nTest if fsync on non-write file descriptor is honored:\n");
  	printf("(If the times are similar, fsync() can sync data written\n");
  	printf("on a different descriptor.)\n");
  
+ 	/* 
+ 	 * first write, fsync and close, which is the 
+ 	 * normal behavior without multiple descriptors
+ 	 */
  	printf(LABEL_FORMAT, "8k write, fsync, close");
  	fflush(stdout);
  	gettimeofday(&start_t, NULL);
*************** main(int argc, char *argv[])
*** 330,343 ****
  		if (fsync(tmpfile) != 0)
  			die("fsync failed");
  		close(tmpfile);
  		if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
  			die("Cannot open output file.");
- 		/* do nothing but the open/close the tests are consistent. */
  		close(tmpfile);
  	}
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
   	printf(LABEL_FORMAT, "8k write, close, fsync");
   	fflush(stdout);
  	gettimeofday(&start_t, NULL);
--- 548,569 ----
  		if (fsync(tmpfile) != 0)
  			die("fsync failed");
  		close(tmpfile);
+ 		/*
+ 		 * open and close the file again to be consistent
+ 		 * with the following test
+ 		 */
  		if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
  			die("Cannot open output file.");
  		close(tmpfile);
  	}
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
+ 	/*
+ 	 * Now open, write, close, open again and fsync
+ 	 * This simulates processes fsyncing each other's
+ 	 * writes.
+ 	 */
   	printf(LABEL_FORMAT, "8k write, close, fsync");
   	fflush(stdout);
  	gettimeofday(&start_t, NULL);
*************** main(int argc, char *argv[])
*** 358,381 ****
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
! 	/* cleanup */
  	free(full_buf);
  	unlink(filename);
  
  	return 0;
  }
  
  void
  print_elapse(struct timeval start_t, struct timeval stop_t)
  {
  	double		total_time = (stop_t.tv_sec - start_t.tv_sec) +
- 	/* usec subtraction might be negative, e.g. 5.4 - 4.8 */
  	(stop_t.tv_usec - start_t.tv_usec) * 0.000001;
  	double		per_second = loops / total_time;
  
  	printf("%9.3f/second\n", per_second);
  }
  
  void
  die(char *str)
  {
--- 584,624 ----
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
! 	/* 
! 	 * cleanup 
! 	 */
  	free(full_buf);
  	unlink(filename);
  
  	return 0;
  }
  
+ /* 
+  * print out the writes per second for most tests
+  */
  void
  print_elapse(struct timeval start_t, struct timeval stop_t)
  {
  	double		total_time = (stop_t.tv_sec - start_t.tv_sec) +
  	(stop_t.tv_usec - start_t.tv_usec) * 0.000001;
  	double		per_second = loops / total_time;
  
  	printf("%9.3f/second\n", per_second);
  }
  
+ /* 
+  * print out the writes per second for writethrough tests
+  */
+ void
+ print_elapse_writethrough(struct timeval start_t, struct timeval stop_t)
+ {
+ 	double		total_time = (stop_t.tv_sec - start_t.tv_sec) +
+ 	(stop_t.tv_usec - start_t.tv_usec) * 0.000001;
+ 	double		per_second = writethrough_loops / total_time;
+ 
+ 	printf("%9.3f/second\n", per_second);
+ }
+ 
  void
  die(char *str)
  {

#26

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Josh Berkus (#25)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Josh Berkus wrote:

On 12/6/10 6:13 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails.
What should I have it do instead?

Report that it fails, and keep testing the other methods.

Patch attached. Includes a fair amount of comment cleanup, since
existing comments did not meet our current project standards. Tests all
6 of the methods we support separately.

Some questions, though:

(1) Why are we doing the open_sync different-size write test? AFAIK,
this doesn't match any behavior which PostgreSQL has.

I did that so we could see the impact of doing 2 8k writes that were
both fsync'ed vs doing one 16k write and then fsync:

Compare open_sync with different sizes:
open_sync 16k write 201.323/second
2 open_sync 8k writes 332.466/second

We often write multiple 8k WAL pages and then fsync on commit.

(2) In this patch, I'm stepping down the number of loops which
fsync_writethrough does by 90%. The reason for that was that on the
platforms where I tested writethrough (desktop machines), doing 10,000
loops took 15-20 *minutes*, which seems hard on the user. Would be easy
to revert if you think it's a bad idea.
Possibly auto-sizing the number of loops based on the first fsync test
might be a good idea, but seems like going a bit too far.

Sure, I recently increased the number, probably too much.

(3) Should the multi-descriptor test be using writethrough on platforms
which support it?

Uh, I didn't think that would matter because the test is to test kernel
behavior of writing to one file descriptor and fsyncing using another.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#27

Bruce Momjian

bruce@momjian.us

about 15 years ago

In reply to: Tom Lane (#21)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

Making it support O_DIRECT would be possible but more complex; I don't
see the point unless we think we're going to have open_sync_with_odirect
as a seperate option.

Whether it's complex or not isn't really the issue. The issue is that
what test_fsync is testing had better match what the backend does, or
people will be making choices based on not-comparable test results.
I think we should have test_fsync just automatically fold in O_DIRECT
the same way the backend does.

The problem is that O_DIRECT was not implemented in macros but rather
down in the code:

if (!XLogIsNeeded() && !am_walreceiver)
o_direct_flag = PG_O_DIRECT;

Which means if we just export the macros, we would still not have caught
this. I would like to share all the defines --- I am just saying it
isn't trivial.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#28

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Bruce Momjian (#27)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Which means if we just export the macros, we would still not have caught
this. I would like to share all the defines --- I am just saying it
isn't trivial.

I just called all the define variables manually rather than relying on
the macros. Seemed to work fine.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#29

Josh Berkus

josh@agliodbs.com

about 15 years ago

In reply to: Josh Berkus (#28)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Greg, All:

Results for Solaris 10u8, on ZFS on a 7-drive attached storage array:

bash-3.00# ./test_fsync -f /dbdata/pgdata/test.out
Loops = 10000

Simple write:
8k write 59988.002/second

Compare file sync methods using one write:
open_datasync 8k write 214.125/second
(unavailable: o_direct)
open_sync 8k write 222.155/second
(unavailable: o_direct)
8k write, fdatasync 214.086/second
8k write, fsync 215.035/second
(unavailable: fsync_writethrough)

Compare file sync methods using two writes:
2 open_datasync 8k writes 108.227/second
(unavailable: o_direct)
2 open_sync 8k writes 106.935/second
(unavailable: o_direct)
8k write, 8k write, fdatasync 205.525/second
8k write, 8k write, fsync 210.483/second
(unavailable: fsync_writethrough)

Compare open_sync with different sizes:
open_sync 16k write 211.481/second
2 open_sync 8k writes 106.202/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 207.499/second
8k write, close, fsync 213.656/second

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#30

Bruce Momjian

bruce@momjian.us

almost 15 years ago

In reply to: Josh Berkus (#25)

1 attachment(s)

Re: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

Josh Berkus wrote:

On 12/6/10 6:13 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails.
What should I have it do instead?

Report that it fails, and keep testing the other methods.

Patch attached. Includes a fair amount of comment cleanup, since
existing comments did not meet our current project standards. Tests all
6 of the methods we support separately.

Some questions, though:

(1) Why are we doing the open_sync different-size write test? AFAIK,
this doesn't match any behavior which PostgreSQL has.

I added program output to explain this.

(2) In this patch, I'm stepping down the number of loops which
fsync_writethrough does by 90%. The reason for that was that on the
platforms where I tested writethrough (desktop machines), doing 10,000
loops took 15-20 *minutes*, which seems hard on the user. Would be easy
to revert if you think it's a bad idea.
Possibly auto-sizing the number of loops based on the first fsync test
might be a good idea, but seems like going a bit too far.

I did not know why writethough we always be much slower than other sync
methods so I just reduced the loop count to 2k.

(3) Should the multi-descriptor test be using writethrough on platforms
which support it?

Thank you for your patch. I have applied most of it, attached.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

/rtmp/fsynctext/x-diffDownload

diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile
index fe3e626..44419ee 100644
*** /tmp/pgdiff.17122/M3047c_Makefile	Sat Jan 15 11:53:43 2011
--- src/tools/fsync/Makefile	Fri Jan 14 22:35:30 2011
*************** override CPPFLAGS := -I$(libpq_srcdir) $
*** 16,24 ****
  
  OBJS= test_fsync.o
  
! all: test_fsync
  
! test_fsync: test_fsync.o | submake-libpq submake-libpgport
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
  
  clean distclean maintainer-clean:
--- 16,24 ----
  
  OBJS= test_fsync.o
  
! all: submake-libpq submake-libpgport test_fsync
  
! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
  
  clean distclean maintainer-clean:
diff --git a/src/tools/fsync/README b/src/tools/fsync/README
index 6d9acd3..a1c2ae4 100644
*** /tmp/pgdiff.17122/wFeqrc_README	Sat Jan 15 11:53:43 2011
--- src/tools/fsync/README	Sat Jan 15 11:34:58 2011
***************
*** 1,11 ****
! src/tools/fsync/README
! 
! fsync
! =====
  
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
  
- Loops defaults to 5000.  The default output file is /var/tmp/test_fsync.out.
- Consider that /tmp or /var/tmp might be memory-based file systems.
--- 1,20 ----
! test_fsync
! ==========
  
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
+ 	
+ test_fsync is intended to give you a reasonable idea of what the fastest
+ fsync_method is on your specific system, as well as supplying diagnostic
+ information in the event of an identified I/O problem.  However,
+ differences shown by test_fsync might not make any difference in real
+ database throughput, especially since many database servers are not
+ speed-limited by their transaction logs.
+ 
+ The output filename defaults to test_fsync.out in the current directory.
+ test_fsync should be run in the same filesystem as your transaction log
+ directory (pg_xlog).
+ 
+ Loops default to 2000.  Increase this to get more accurate measurements.
  
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index 28c2119..831e2a0 100644
*** /tmp/pgdiff.17122/OSeljb_test_fsync.c	Sat Jan 15 11:53:43 2011
--- src/tools/fsync/test_fsync.c	Sat Jan 15 11:52:03 2011
***************
*** 3,9 ****
   *
   *
   *	test_fsync.c
!  *		test various fsync() methods
   */
  
  #include "postgres.h"
--- 3,9 ----
   *
   *
   *	test_fsync.c
!  *		tests all supported fsync() methods
   */
  
  #include "postgres.h"
***************
*** 22,40 ****
  #include <unistd.h>
  #include <string.h>
  
! 
! #ifdef WIN32
  #define FSYNC_FILENAME	"./test_fsync.out"
- #else
- /* /tmp might be a memory file system */
- #define FSYNC_FILENAME	"/var/tmp/test_fsync.out"
- #endif
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	"\t%-30s"
  
! int			loops = 10000;
  
  void		die(char *str);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
--- 22,39 ----
  #include <unistd.h>
  #include <string.h>
  
! /* 
!  * put the temp files in the local directory
!  * unless the user specifies otherwise 
!  */
  #define FSYNC_FILENAME	"./test_fsync.out"
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	"\t%-30s"
  
! 
! int			loops = 2000;
  
  void		die(char *str);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
*************** void		print_elapse(struct timeval start_
*** 42,55 ****
  int
  main(int argc, char *argv[])
  {
! 	struct timeval start_t;
! 	struct timeval stop_t;
! 	int			tmpfile,
! 				i;
  	char	   *full_buf = (char *) malloc(XLOG_SEG_SIZE),
! 			   *buf;
! 	char	   *filename = FSYNC_FILENAME;
  
  	if (argc > 2 && strcmp(argv[1], "-f") == 0)
  	{
  		filename = argv[2];
--- 41,54 ----
  int
  main(int argc, char *argv[])
  {
! 	struct timeval start_t, stop_t;
! 	int			tmpfile, i;
  	char	   *full_buf = (char *) malloc(XLOG_SEG_SIZE),
! 			   *buf, *filename = FSYNC_FILENAME;
  
+ 	/* 
+ 	 * arguments: loops and filename (optional) 
+ 	 */
  	if (argc > 2 && strcmp(argv[1], "-f") == 0)
  	{
  		filename = argv[2];
*************** main(int argc, char *argv[])
*** 57,73 ****
  		argc -= 2;
  	}
  
! 	if (argc > 1)
  		loops = atoi(argv[1]);
  
  	for (i = 0; i < XLOG_SEG_SIZE; i++)
  		full_buf[i] = random();
  
  	if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1)
  		die("Cannot open output file.");
  	if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
  		die("write failed");
! 	/* fsync now so later fsync's don't have to do it */
  	if (fsync(tmpfile) != 0)
  		die("fsync failed");
  	close(tmpfile);
--- 56,77 ----
  		argc -= 2;
  	}
  
! 	if (argc > 1) 
  		loops = atoi(argv[1]);
  
  	for (i = 0; i < XLOG_SEG_SIZE; i++)
  		full_buf[i] = random();
  
+ 	/* 
+ 	 * test if we can open the target file 
+ 	 */
  	if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1)
  		die("Cannot open output file.");
  	if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
  		die("write failed");
! 	/*
! 	 * fsync now so that dirty buffers don't skew later tests
! 	 */
  	if (fsync(tmpfile) != 0)
  		die("fsync failed");
  	close(tmpfile);
*************** main(int argc, char *argv[])
*** 77,83 ****
  	printf("Loops = %d\n\n", loops);
  
  	/*
! 	 * Simple write
  	 */
  	printf("Simple write:\n");
  	printf(LABEL_FORMAT, "8k write");
--- 81,87 ----
  	printf("Loops = %d\n\n", loops);
  
  	/*
! 	 * Test a simple write without fsync
  	 */
  	printf("Simple write:\n");
  	printf(LABEL_FORMAT, "8k write");
*************** main(int argc, char *argv[])
*** 95,104 ****
  	print_elapse(start_t, stop_t);
  
  	/*
! 	 * Compare file sync methods with one 8k write
  	 */
  	printf("\nCompare file sync methods using one write:\n");
  
  #ifdef OPEN_DATASYNC_FLAG
  	printf(LABEL_FORMAT, "open_datasync 8k write");
  	fflush(stdout);
--- 99,111 ----
  	print_elapse(start_t, stop_t);
  
  	/*
! 	 * Test all fsync methods using single 8k writes
  	 */
  	printf("\nCompare file sync methods using one write:\n");
  
+ 	/*
+ 	 * Test open_datasync if available
+ 	 */
  #ifdef OPEN_DATASYNC_FLAG
  	printf(LABEL_FORMAT, "open_datasync 8k write");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 115,124 ****
--- 122,161 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 
+ 	/*
+ 	 * If O_DIRECT is enabled, test that with open_datasync
+ 	 */
+ #if PG_O_DIRECT != 0
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
+ 		printf("\t(unavailable: o_direct on this filesystem)\n");
+ 	else
+ 	{
+ 		printf(LABEL_FORMAT, "open_datasync 8k direct I/O write");
+ 		gettimeofday(&start_t, NULL);
+ 		for (i = 0; i < loops; i++)
+ 		{
+ 			if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 				die("write failed");
+ 			if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 				die("seek failed");
+ 		}
+ 		gettimeofday(&stop_t, NULL);
+ 		close(tmpfile);
+ 		print_elapse(start_t, stop_t);
+ 	}
+ #else
+ 		printf("\t(unavailable: o_direct)\n");
+ #endif
+ 
  #else
  	printf("\t(unavailable: open_datasync)\n");
  #endif
  
+ /*
+  * Test open_sync if available
+  */
  #ifdef OPEN_SYNC_FLAG
  	printf(LABEL_FORMAT, "open_sync 8k write");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 135,144 ****
--- 172,211 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 
+ 	/*
+ 	 * If O_DIRECT is enabled, test that with open_sync
+ 	 */
+ #if PG_O_DIRECT != 0
+ 	printf(LABEL_FORMAT, "open_sync 8k direct I/O write");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1)
+ 		printf("\t(unavailable: o_direct on this filesystem)\n");
+ 	else
+ 	{
+ 		gettimeofday(&start_t, NULL);
+ 		for (i = 0; i < loops; i++)
+ 		{
+ 			if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 				die("write failed");
+ 			if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 				die("seek failed");
+ 		}
+ 		gettimeofday(&stop_t, NULL);
+ 		close(tmpfile);
+ 		print_elapse(start_t, stop_t);
+ 	}
+ #else
+ 	printf("\t(unavailable: o_direct)\n");
+ #endif
+ 
  #else
  	printf("\t(unavailable: open_sync)\n");
  #endif
  
+ /*
+  * Test fdatasync if available
+  */
  #ifdef HAVE_FDATASYNC
  	printf(LABEL_FORMAT, "8k write, fdatasync");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 160,165 ****
--- 227,235 ----
  	printf("\t(unavailable: fdatasync)\n");
  #endif
  
+ /*
+  * Test fsync
+  */
  	printf(LABEL_FORMAT, "8k write, fsync");
  	fflush(stdout);
  	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
*************** main(int argc, char *argv[])
*** 177,190 ****
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
  
  	/*
! 	 * Compare file sync methods with two 8k write
  	 */
  	printf("\nCompare file sync methods using two writes:\n");
  
  #ifdef OPEN_DATASYNC_FLAG
! 	printf(LABEL_FORMAT, "2 open_datasync 8k writes");
  	fflush(stdout);
  	if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
  		die("Cannot open output file.");
--- 247,289 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ /*
+  * If fsync_writethrough is available, test as well
+  */	
+ #ifdef HAVE_FSYNC_WRITETHROUGH
+ 	printf(LABEL_FORMAT, "8k write, fsync_writethrough");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
+ 		die("Cannot open output file.");
+ 	gettimeofday(&start_t, NULL);
+ 	for (i = 0; i < loops; i++)
+ 	{
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (fcntl(tmpfile, F_FULLFSYNC ) != 0)
+ 			die("fsync failed");
+ 		if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 			die("seek failed");
+ 	}
+ 	gettimeofday(&stop_t, NULL);
+ 	close(tmpfile);
+ 	print_elapse(start_t, stop_t);
+ #else
+ 	printf("\t(unavailable: fsync_writethrough)\n");
+ #endif
  
  	/*
! 	 * Compare some of the file sync methods with 
! 	 * two 8k writes to see if timing is different
  	 */
  	printf("\nCompare file sync methods using two writes:\n");
  
+ /*
+  * Test open_datasync with and without o_direct
+  */
  #ifdef OPEN_DATASYNC_FLAG
!  	printf(LABEL_FORMAT, "2 open_datasync 8k writes");
  	fflush(stdout);
  	if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
  		die("Cannot open output file.");
*************** main(int argc, char *argv[])
*** 201,210 ****
--- 300,335 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ #if PG_O_DIRECT != 0
+ 	printf(LABEL_FORMAT, "2 open_datasync direct I/O 8k writes");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1)
+ 		die("Cannot open output file.");
+ 	gettimeofday(&start_t, NULL);
+ 	for (i = 0; i < loops; i++)
+ 	{
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 			die("seek failed");
+ 	}
+ 	gettimeofday(&stop_t, NULL);
+ 	close(tmpfile);
+ 	print_elapse(start_t, stop_t);
+ #else
+ 		printf("\t(unavailable: o_direct)\n");
+ #endif
+ 
  #else
  	printf("\t(unavailable: open_datasync)\n");
  #endif
  
+ /*
+  * Test open_sync with and without o_direct
+  */
  #ifdef OPEN_SYNC_FLAG
  	printf(LABEL_FORMAT, "2 open_sync 8k writes");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 223,230 ****
--- 348,383 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ #if PG_O_DIRECT != 0
+ 	printf(LABEL_FORMAT, "2 open_sync direct I/O 8k writes");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1)
+ 		die("Cannot open output file.");
+ 	gettimeofday(&start_t, NULL);
+ 	for (i = 0; i < loops; i++)
+ 	{
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 			die("seek failed");
+ 	}
+ 	gettimeofday(&stop_t, NULL);
+ 	close(tmpfile);
+ 	print_elapse(start_t, stop_t);
+ #else
+ 	printf("\t(unavailable: o_direct)\n");
+ #endif
+ 
+ #else
+ 	printf("\t(unavailable: open_sync)\n");
  #endif
  
+ /*
+  *	Test fdatasync
+  */
  #ifdef HAVE_FDATASYNC
  	printf(LABEL_FORMAT, "8k write, 8k write, fdatasync");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 248,253 ****
--- 401,409 ----
  	printf("\t(unavailable: fdatasync)\n");
  #endif
  
+ /*
+  * Test basic fsync
+  */
  	printf(LABEL_FORMAT, "8k write, 8k write, fsync");
  	fflush(stdout);
  	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
*************** main(int argc, char *argv[])
*** 267,278 ****
--- 423,466 ----
  	gettimeofday(&stop_t, NULL);
  	close(tmpfile);
  	print_elapse(start_t, stop_t);
+ 	
+ /*
+  * Test fsync_writethrough if available
+  */	
+ #ifdef HAVE_FSYNC_WRITETHROUGH
+ 	printf(LABEL_FORMAT, "8k write, 8k write, fsync_writethrough");
+ 	fflush(stdout);
+ 	if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
+ 		die("Cannot open output file.");
+ 	gettimeofday(&start_t, NULL);
+ 	for (i = 0; i < loops; i++)
+ 	{
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE)
+ 			die("write failed");
+ 		if (fcntl(tmpfile, F_FULLFSYNC) != 0)
+ 			die("fsync failed");
+ 		if (lseek(tmpfile, 0, SEEK_SET) == -1)
+ 			die("seek failed");
+ 	}
+ 	gettimeofday(&stop_t, NULL);
+ 	close(tmpfile);
+ 	print_elapse(start_t, stop_t);
+ #else
+ 	printf("\t(unavailable: fsync_writethrough)\n");
+ #endif
  
  	/*
  	 * Compare 1 to 2 writes
  	 */
  	printf("\nCompare open_sync with different sizes:\n");
+ 	printf("(This is designed to compare the cost of one large\n");
+ 	printf("sync'ed write and two smaller sync'ed writes.)\n");
  
+ /*
+  * Test open_sync with different size files
+  */
  #ifdef OPEN_SYNC_FLAG
  	printf(LABEL_FORMAT, "open_sync 16k write");
  	fflush(stdout);
*************** main(int argc, char *argv[])
*** 312,323 ****
  #endif
  
  	/*
! 	 * Fsync another file descriptor?
  	 */
  	printf("\nTest if fsync on non-write file descriptor is honored:\n");
  	printf("(If the times are similar, fsync() can sync data written\n");
  	printf("on a different descriptor.)\n");
  
  	printf(LABEL_FORMAT, "8k write, fsync, close");
  	fflush(stdout);
  	gettimeofday(&start_t, NULL);
--- 500,519 ----
  #endif
  
  	/*
! 	 * Test whether fsync can sync data written on a different
! 	 * descriptor for the same file.  This checks the efficiency
! 	 * of multi-process fsyncs against the same file.
! 	 * Possibly this should be done with writethrough on platforms
! 	 * which support it.
  	 */
  	printf("\nTest if fsync on non-write file descriptor is honored:\n");
  	printf("(If the times are similar, fsync() can sync data written\n");
  	printf("on a different descriptor.)\n");
  
+ 	/* 
+ 	 * first write, fsync and close, which is the 
+ 	 * normal behavior without multiple descriptors
+ 	 */
  	printf(LABEL_FORMAT, "8k write, fsync, close");
  	fflush(stdout);
  	gettimeofday(&start_t, NULL);
*************** main(int argc, char *argv[])
*** 330,343 ****
  		if (fsync(tmpfile) != 0)
  			die("fsync failed");
  		close(tmpfile);
  		if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
  			die("Cannot open output file.");
- 		/* do nothing but the open/close the tests are consistent. */
  		close(tmpfile);
  	}
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
   	printf(LABEL_FORMAT, "8k write, close, fsync");
   	fflush(stdout);
  	gettimeofday(&start_t, NULL);
--- 526,547 ----
  		if (fsync(tmpfile) != 0)
  			die("fsync failed");
  		close(tmpfile);
+ 		/*
+ 		 * open and close the file again to be consistent
+ 		 * with the following test
+ 		 */
  		if ((tmpfile = open(filename, O_RDWR, 0)) == -1)
  			die("Cannot open output file.");
  		close(tmpfile);
  	}
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
+ 	/*
+ 	 * Now open, write, close, open again and fsync
+ 	 * This simulates processes fsyncing each other's
+ 	 * writes.
+ 	 */
   	printf(LABEL_FORMAT, "8k write, close, fsync");
   	fflush(stdout);
  	gettimeofday(&start_t, NULL);
*************** main(int argc, char *argv[])
*** 358,375 ****
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
! 	/* cleanup */
  	free(full_buf);
  	unlink(filename);
  
  	return 0;
  }
  
  void
  print_elapse(struct timeval start_t, struct timeval stop_t)
  {
  	double		total_time = (stop_t.tv_sec - start_t.tv_sec) +
- 	/* usec subtraction might be negative, e.g. 5.4 - 4.8 */
  	(stop_t.tv_usec - start_t.tv_usec) * 0.000001;
  	double		per_second = loops / total_time;
  
--- 562,583 ----
  	gettimeofday(&stop_t, NULL);
  	print_elapse(start_t, stop_t);
  
! 	/* 
! 	 * cleanup 
! 	 */
  	free(full_buf);
  	unlink(filename);
  
  	return 0;
  }
  
+ /* 
+  * print out the writes per second for tests
+  */
  void
  print_elapse(struct timeval start_t, struct timeval stop_t)
  {
  	double		total_time = (stop_t.tv_sec - start_t.tv_sec) +
  	(stop_t.tv_usec - start_t.tv_usec) * 0.000001;
  	double		per_second = loops / total_time;