[PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Hi list,
PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php
NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior. Users might be in for a nasty
performance surprise when re-building their Postgres with newer Linux
headers (as was I), so I propose that this change should be made in
all supported releases.
-- commit message --
Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Linux kernel headers from 2.6.33 (and later) change the behavior of the
O_SYNC flag. Previously O_SYNC was aliased to O_DSYNC, which caused
PostgreSQL to use fdatasync as the default instead.
Starting with kernels 2.6.33 and later, the definitions of O_DSYNC and
O_SYNC differ. When built with headers from these newer kernels,
PostgreSQL will default to using open_datasync. This patch reverts the
Linux default to fdatasync, which has had much more testing over time
and also significantly better performance.
-- end commit message --
Earlier kernel headers defined O_SYNC and O_DSYNC to 0x1000
2.6.33 and later define O_SYNC=0x101000 and O_DSYNC=0x1000 (since old
behavior on most FS-es was always equivalent to POSIX O_DSYNC)
More details at:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b2f3d1f769be5779b479c37800229d9a4809fc3
Currently PostgreSQL's include/access/xlogdefs.h defaults to using
open_datasync when O_SYNC != O_DSYNC, otherwise fdatasync is used.
Since other platforms might want to default to fdatasync in the
future, too, I defined a new PLATFORM_DEFAULT_SYNC_METHOD constant in
include/port/linux.h. I don't know if this is the best way to do it.
Regards,
Marti
Attachments:
0001-Revert-default-wal_sync_method-to-fdatasync-on-Linux.patchtext/x-patch; charset=US-ASCII; name=0001-Revert-default-wal_sync_method-to-fdatasync-on-Linux.patchDownload+14-1
Marti Raudsepp <marti@juffo.org> writes:
PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php
NB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.
I'm less than convinced this is the right approach ...
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?
regards, tom lane
On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm less than convinced this is the right approach ...
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?
Sure, maybe for PostgreSQL 9.1
But the immediate problem is older releases (8.1 - 9.0) specifically
on Linux. Something as innocuous as re-building your DB on a newer
kernel will radically affect performance -- even when the DB kernel
didn't change.
So I think we should aim to fix old versions first. Do you disagree?
Regards,
Marti
Marti Raudsepp <marti@juffo.org> writes:
On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?
So I think we should aim to fix old versions first. Do you disagree?
What's that got to do with it?
regards, tom lane
On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Marti Raudsepp <marti@juffo.org> writes:
On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?So I think we should aim to fix old versions first. Do you disagree?
What's that got to do with it?
I'm not sure what you're asking.
Surely changing the default wal_sync_method for all OSes in
maintenance releases is out of the question, no?
Regards,
Marti
Marti Raudsepp <marti@juffo.org> writes:
On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
What's that got to do with it?
I'm not sure what you're asking.
Surely changing the default wal_sync_method for all OSes in
maintenance releases is out of the question, no?
Well, if we could leave well enough alone it would be fine with me,
but I think our hand is being forced by the Linux kernel hackers.
I don't really think that "change the default on Linux" is that
much nicer than "change the default everywhere" when it comes to
what we ought to consider back-patching. In any case, you're getting
ahead of the game: we need to decide on the desired behavior first and
then think about what to patch. Do the performance results that were
cited show that open_dsync is generally inferior to fdatasync? If so,
why would we think that that conclusion is Linux-specific?
regards, tom lane
On Friday 05 November 2010 19:13:47 Tom Lane wrote:
Marti Raudsepp <marti@juffo.org> writes:
PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.phpNB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.I'm less than convinced this is the right approach ...
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?
I fail to see how it could be beneficial on *any* non-buggy platform.
Especially with small wal_buffers and larger commits (but also otherwise) it
increases the amount of synchronous writes the os has to do tremendously.
* It removes about all benefits of XLogBackgroundFlush()
* It removes any chances of reordering after writing.
* It makes AdvanceXLInsertBuffer synchronous if it has to write outy
Whats the theory about placing it so high in the preferences list?
Andres
On Fri, Nov 5, 2010 at 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't really think that "change the default on Linux" is that
much nicer than "change the default everywhere" when it comes to
what we ought to consider back-patching. In any case, you're getting
ahead of the game: we need to decide on the desired behavior first and
then think about what to patch.
We should be trying to guarantee the stability of maintenance
releases. "Stability" includes consistent defaults. The fact that
Linux now distinguishes between these two flags has a very surprising
effect on PostgreSQL's defaults; an effect that wasn't intended by any
developer, is not documented anywhere, and certainly won't be
anticipated by users.
Do you reject this premise?
As newer distros are adopting 2.6.33+ kernels, more and more people
will shoot themselves in the foot by this change. I am also worried
that it will have a direct effect on PostgreSQL adoption.
Regards,
Marti
Tom Lane wrote:
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?
And I've seen the expected sync write performance gain over fdatasync on
a system with a battery-backed cache running VxFS on Linux, because
working open_[d]sync means O_DIRECT writes bypassing the OS cache, and
therefore reducing cache pollution from WAL writes. This doesn't work
by default on Solaris because they have a special system call you have
to execute for direct output, but if you trick the OS into doing that
via mount options you can observe it there too. The last serious tests
of this area I saw on that platform were from Jignesh, and they
certainly didn't show a significant performance regression running in
sync mode. I vaguely recall seeing a set once that showed a minor loss
compared to fdatasync, but it was too close to make any definitive
statement about reordering.
I haven't seen any report yet of a serious performance regression in the
new Linux case that was written by someone who understands fully how
fsync and drive cache flushing are supposed to interact. It's been
obvious for a year now that the reports from Phoronix about this had no
idea what they were actually testing. I didn't see anything from
Marti's report that definitively answers whether this is anything other
than Linux finally doing the right thing to flush drive caches out when
sync writes happen. There may be a performance regression here related
to WAL data going out in smaller chunks than it used to, but in all the
reports I've seen it that hasn't been isolated well enough to consider
making any changes yet--to tell if it's a performance loss or a
reliability gain we're seeing.
I'd like to see some output from the 9.0 test_fsync on one of these
RHEL6 systems on a system without a battery backed write cache as a
first step here. That should start to shed some light on what's
happening. I just bumped up the priority on the pending upgrade of my
spare laptop to the RHEL6 beta I had been trying to find time for, so I
can investigate this further myself.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
Andres Freund <andres@anarazel.de> writes:
On Friday 05 November 2010 19:13:47 Tom Lane wrote:
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?
I fail to see how it could be beneficial on *any* non-buggy platform.
Especially with small wal_buffers and larger commits (but also otherwise) it
increases the amount of synchronous writes the os has to do tremendously.
* It removes about all benefits of XLogBackgroundFlush()
* It removes any chances of reordering after writing.
* It makes AdvanceXLInsertBuffer synchronous if it has to write outy
Whats the theory about placing it so high in the preferences list?
I think the original idea was that if you had a dedicated WAL drive then
sync-on-write would be reasonable. But that was a very long time ago
and I'm not sure that the system's behavior is anything like what it was
then; for that matter I'm not sure we had proof that it was an optimal
choice even back then. That's why I want to revisit the choice of
default and not just go for "minimum" change.
regards, tom lane
On Friday 05 November 2010 22:53:37 Greg Smith wrote:
If open_dsync is so bad for performance on Linux, maybe it's bad
everywhere? Should we be rethinking the default preference order?And I've seen the expected sync write performance gain over fdatasync on
a system with a battery-backed cache running VxFS on Linux, because
working open_[d]sync means O_DIRECT writes bypassing the OS cache, and
therefore reducing cache pollution from WAL writes.
Which looks like a setup where you definitely need to know what you do. I.e.
don't need support from wal_sync_method by default being open_fdatasync...
Andres
I think the original idea was that if you had a dedicated WAL drive then
sync-on-write would be reasonable. But that was a very long time ago
and I'm not sure that the system's behavior is anything like what it was
then; for that matter I'm not sure we had proof that it was an optimal
choice even back then. That's why I want to revisit the choice of
default and not just go for "minimum" change.
What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes:
What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?
At least. I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.
regards, tom lane
On 11/5/10 3:31 PM, Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
What plaforms do we need to test to get a reasonable idea? Solaris,
FreeBSD, Windows?At least. I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.
I could test at least 1 version of Solaris, I think.
Greg, any recommendations on pgbench parameters?
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
Tom Lane wrote:
I'm hoping that Greg Smith will take the lead on testing
this, since he seems to have spent the most time in the area so far.
It's not coincidence that the chapter of my book I convinced the
publisher to release as a sample is the one that covers this area; this
mess has been visibly approaching for some time now. I'm going to put
RHEL6 onto a system and start collecting some proper slowdown numbers
this week, then pass along a suggested test regime for others.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
Marti Raudsepp wrote:
PostgreSQL's default settings change when built with Linux kernel
headers 2.6.33 or newer. As discussed on the pgsql-performance list,
this causes a significant performance regression:
http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.phpNB! I am not proposing to change the default -- to the contrary --
this patch restores old behavior.
Following our standard community development model, I've put this patch
onto our CommitFest list:
https://commitfest.postgresql.org/action/patch_view?id=432 and assigned
myself as the reviewer. I didn't look at this until now because I
already had some patch development and review work to finish before the
CommitFest deadline we just crossed. Now I can go back to reviewing
other people's work.
P.S. There is no pgsql-patch list anymore; everything goes through the
hackers list now.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
All,
So, this week I've had my hands on a medium-high-end test system where I
could test various wal_sync_methods. This is a 24-core Intel Xeon
machine with 72GB of ram, and 8 internal 10K SAS disks attached to a
raid controller with 512MB BBU write cache. 2 of the disks are in a
RAID1, which supports both an Ext4 partition and an XFS partition. The
remaining disks are in a RAID10 which only supports a single pgdata
partition.
This is running on RHEL6, Linux Kernel: 2.6.32-71.el6.x86_64
I think this kind of a system much better represents our users who are
performance-conscious than testing on people's laptops or on VMs does.
I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory. I think
the second change should be permanent; I imagine that a lot of people
who are running test_fsync are not aware that they're actually testing
the performance of /var/tmp, not whatever FS mount they wanted to test.
Here's the results. I think you'll agree that, at least on Linux, the
benefits of o_sync and o_dsync as defaults would be highly questionable.
Particularly, it seems that if O_DIRECT support is absent, fdatasync is
across-the-board faster:
=============
test_fsync with directIO, on 2 drives, XFS tuned:
Loops = 10000
Simple write:
8k write 198629.457/second
Compare file sync methods using one write:
open_datasync 8k write 14798.263/second
open_sync 8k write 14316.864/second
8k write, fdatasync 12198.871/second
8k write, fsync 12371.843/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 7362.805/second
2 open_sync 8k writes 7156.685/second
8k write, 8k write, fdatasync 10613.525/second
8k write, 8k write, fsync 10597.396/second
Compare open_sync with different sizes:
open_sync 16k write 13631.816/second
2 open_sync 8k writes 7645.038/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 11427.096/second
8k write, close, fsync 11321.220/second
test_fsync with directIO, on 6 drives RAID10, XFS tuned:
Loops = 10000
Simple write:
8k write 196494.537/second
Compare file sync methods using one write:
open_datasync 8k write 14909.974/second
open_sync 8k write 14559.326/second
8k write, fdatasync 11046.025/second
8k write, fsync 11046.916/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 7349.223/second
2 open_sync 8k writes 7667.395/second
8k write, 8k write, fdatasync 9560.495/second
8k write, 8k write, fsync 9557.287/second
Compare open_sync with different sizes:
open_sync 16k write 12060.049/second
2 open_sync 8k writes 7650.746/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 9377.107/second
8k write, close, fsync 9251.233/second
test_fsync without directIO on RAID1, Ext4, data=journal:
Loops = 10000
Simple write:
8k write 150514.005/second
Compare file sync methods using one write:
open_datasync 8k write 4012.070/second
open_sync 8k write 5476.898/second
8k write, fdatasync 5512.649/second
8k write, fsync 5803.814/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 2910.401/second
2 open_sync 8k writes 2817.377/second
8k write, 8k write, fdatasync 5041.608/second
8k write, 8k write, fsync 5155.248/second
Compare open_sync with different sizes:
open_sync 16k write 4895.956/second
2 open_sync 8k writes 2720.875/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 4724.052/second
8k write, close, fsync 4694.776/second
test_fsync without directIO on RAID1, XFS, tuned:
Loops = 10000
Simple write:
8k write 199796.208/second
Compare file sync methods using one write:
open_datasync 8k write 12553.525/second
open_sync 8k write 12535.978/second
8k write, fdatasync 12268.298/second
8k write, fsync 12305.875/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 6323.835/second
2 open_sync 8k writes 6285.169/second
8k write, 8k write, fdatasync 10893.756/second
8k write, 8k write, fsync 10752.607/second
Compare open_sync with different sizes:
open_sync 16k write 11053.510/second
2 open_sync 8k writes 6293.270/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 11087.482/second
8k write, close, fsync 11157.477/second
test_fsync without directIO on RAID10, 6 drives, XFS Tuned:
Loops = 10000
Simple write:
8k write 197262.003/second
Compare file sync methods using one write:
open_datasync 8k write 12784.699/second
open_sync 8k write 12684.512/second
8k write, fdatasync 12404.547/second
8k write, fsync 12452.757/second
Compare file sync methods using two writes:
2 open_datasync 8k writes 6376.587/second
2 open_sync 8k writes 6364.113/second
8k write, 8k write, fdatasync 9895.699/second
8k write, 8k write, fsync 9866.886/second
Compare open_sync with different sizes:
open_sync 16k write 10156.491/second
2 open_sync 8k writes 6400.889/second
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 11142.620/second
8k write, close, fsync 11076.393/second
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
All,
While I have this machine available I've been trying to run some
performance tests using pgbench and various wal_sync_methods. However,
I seem to be maxing out at the speed of pgbench itself; no matter which
wal_sync_method I use (including "fsync"), it tops out at around 2750 TPS.
Of course, it's also possible that the wal_sync_method does not in fact
make a difference in throughput.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
Josh Berkus wrote:
I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory.
Patch please? I agree with the latter change; what test_fsync does is
surprising.
I suggested a while ago that we refactor test_fsync to use a common set
of source code as the database itself for detecting things related to
wal_sync_method, perhaps just extract that whole set of DEFINE macro
logic to somewhere else. That happened at a bad time in the development
cycle (right before a freeze) and nobody ever got back to the idea
afterwards. If this code is getting touched, and it's clear it is in
some direction, I'd like to see things change so it's not possible for
the two to diverge again afterwards.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
On 12/5/10 2:12 PM, Greg Smith wrote:
Josh Berkus wrote:
I modified test_fsync in two ways to run this; first, to make it support
O_DIRECT, and second to make it run in the *current* directory.Patch please? I agree with the latter change; what test_fsync does is
surprising.
Attached.
Making it support O_DIRECT would be possible but more complex; I don't
see the point unless we think we're going to have open_sync_with_odirect
as a seperate option.
I suggested a while ago that we refactor test_fsync to use a common set
of source code as the database itself for detecting things related to
wal_sync_method, perhaps just extract that whole set of DEFINE macro
logic to somewhere else. That happened at a bad time in the development
cycle (right before a freeze) and nobody ever got back to the idea
afterwards. If this code is getting touched, and it's clear it is in
some direction, I'd like to see things change so it's not possible for
the two to diverge again afterwards.
I don't quite follow you. Maybe nobody else did last time, either.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com