We really ought to do something about O_DIRECT and data=journalled on ext4

Started by Josh Berkusover 15 years ago35 messageshackers
Jump to latest
#1Josh Berkus
josh@agliodbs.com

Hackers,

Some of you might already be aware that this combination produces a
fatal startup crash in PostgreSQL:

1. Create an Ext3 or Ext4 partition and mount it with data=journal on a
server with linux kernel 2.6.30 or later.
2. Initdb a PGDATA on that partition
3. Start PostgreSQL with the default config from that PGDATA

This was reported a ways back:
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=567113

To explain: calling O_DIRECT on an ext3 or ext4 partition with
data=journalled causes a crash. However, recent Linux kernels now
report support for O_DIRECT when we compile PostgreSQL, so we use it by
default. This results in a "crash by default" situation with new
Linuxes if anyone sets data=journal.

We just encountered this again with another user. With RHEL6 out now,
this seems likely to become a fairly common crash report.

Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#1)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Josh Berkus <josh@agliodbs.com> writes:

Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

regards, tom lane

#3Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#2)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On 11/30/10 7:09 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

Are we considering backporting that change?

If so, this would be another argument in favor of changing the default.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On 11/30/2010 10:09 PM, Tom Lane wrote:

Josh Berkus<josh@agliodbs.com> writes:

Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

Tom,

we've just had a significant PGX customer encounter this with the latest
Postgres on Redhat's freshly released flagship product. Presumably the
default wal_sync_method will only change prospectively. But this will
feel to every user out there who encounters it like a bug in our code,
and it needs attention. It was darn difficult to diagnose, and many
people will just give up in disgust if they encounter it.

cheers

andrew

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#4)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Andrew Dunstan <andrew@dunslane.net> writes:

On 11/30/2010 10:09 PM, Tom Lane wrote:

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

we've just had a significant PGX customer encounter this with the latest
Postgres on Redhat's freshly released flagship product. Presumably the
default wal_sync_method will only change prospectively.

I don't think so. The fact that Linux is changing underneath us is a
compelling reason for back-patching a change here. Our older branches
still have to be able to run on modern OS versions. I'm also fairly
unclear on what you think a fix would look like if it's not effectively
a change in the default.

(Hint: this *will* be changing, one way or another, in Red Hat's version
of 8.4, since that's what RH is shipping in RHEL6.)

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#3)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Josh Berkus <josh@agliodbs.com> writes:

On 11/30/10 7:09 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

Are we considering backporting that change?

If so, this would be another argument in favor of changing the default.

Well, no, actually it's the same (only) argument. We'd never consider
back-patching such a change if our hand weren't being forced by kernel
changes :-(

As things stand, though, I think the only thing that's really open for
discussion is how wide to make the scope of the default-change: should
we just do it across the board, or try to limit it to some subset of the
platforms where open_datasync is currently the default. And that's a
decision that ought to be informed by some performance testing.

regards, tom lane

#7Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#6)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Tom Lane <tgl@sss.pgh.pa.us> writes:

As things stand, though, I think the only thing that's really open for
discussion is how wide to make the scope of the default-change: should
we just do it across the board, or try to limit it to some subset of the
platforms where open_datasync is currently the default. And that's a
decision that ought to be informed by some performance testing.

Maybe I have a distorded view of the situation for having hit the
problem with an ubuntu upgrade, but it really does not look like a
performance item to me.

PANIC: could not open file "pg_xlog/000000010000000000000001" (log file 0, segment 1): Invalid argument

It took me quite some time to be able to start my development cluster
again and validate some new patch to send to the list.

Now I understand that you want to test the other alternatives before to
choose among those which work, but my opinion is that it should be fixed
in HEAD before next alpha, or even ASAP. It could be that a HINT here
would be enough for contributors not to lose to much time. It would be

HINT: if you're running linux, please try to change wal_sync_method,
open_datasync is not reliable anymore in recent kernels. An example of
trustworthy setting is fdatasync.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#8Marti Raudsepp
marti@juffo.org
In reply to: Dimitri Fontaine (#7)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On Wed, Dec 1, 2010 at 12:35, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:

PANIC:  could not open file "pg_xlog/000000010000000000000001" (log file 0, segment 1): Invalid argument

+1 I got the same error when trying to get PostgreSQL working on tmpfs
and gave up.

Now I understand that you want to test the other alternatives before to
choose among those which work, but my opinion is that it should be fixed
in HEAD before next alpha, or even ASAP.

It's queued for this month's commitfest, so things are moving.

https://commitfest.postgresql.org/action/patch_view?id=432

Regards,
Marti

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On Wed, Dec 1, 2010 at 12:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Josh Berkus <josh@agliodbs.com> writes:

On 11/30/10 7:09 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

Apparently, testing for O_DIRECT at compile time isn't adequate.  Ideas?

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

Are we considering backporting that change?

If so, this would be another argument in favor of changing the default.

Well, no, actually it's the same (only) argument.  We'd never consider
back-patching such a change if our hand weren't being forced by kernel
changes :-(

As things stand, though, I think the only thing that's really open for
discussion is how wide to make the scope of the default-change: should
we just do it across the board, or try to limit it to some subset of the
platforms where open_datasync is currently the default.  And that's a
decision that ought to be informed by some performance testing.

If we could get a clear idea of what performance testing needs to be
done, I suspect we could find some people willing to do it. What do
you think would be useful?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#5)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On 11/30/2010 11:17 PM, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 11/30/2010 10:09 PM, Tom Lane wrote:

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

we've just had a significant PGX customer encounter this with the latest
Postgres on Redhat's freshly released flagship product. Presumably the
default wal_sync_method will only change prospectively.

I don't think so. The fact that Linux is changing underneath us is a
compelling reason for back-patching a change here. Our older branches
still have to be able to run on modern OS versions. I'm also fairly
unclear on what you think a fix would look like if it's not effectively
a change in the default.

(Hint: this *will* be changing, one way or another, in Red Hat's version
of 8.4, since that's what RH is shipping in RHEL6.)

Well, my initial idea was that if PG_O_DIRECT is non-zero, we should
test at startup time if we can use it on the WAL file system and inhibit
its use if not.

Incidentally, I notice it's not used at all in test_fsync.c - should it
not be?

cheers

andrew

#11Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#6)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Tom,

Well, no, actually it's the same (only) argument. We'd never consider
back-patching such a change if our hand weren't being forced by kernel
changes :-(

I think we have to back-patch the change. The way it is now, a DBA who
thinks they are doing normal sensible configuration can cause PostgreSQL
to fail to restart. Imagine this scenario, for example:

1) DBA, using PostgreSQL 8.3, gets worried about possible disk issues
2) DBA changes their single Ext3/4 partition to "data=journal"
3) DBA restarts system
4) PostgreSQL won't start
5) DBA thrashes around for a few hours while the site is down
6) DBA gets fired and the new DBA migrates to some other DBMS.

I simply can't think of *anywhere* we could put the information about
opensync and Linux/Ext which would be prominent enough to avoid the
above scenario. And per replies, a lot of people have hit this issue
already.

It's a bug and it's our bug. Back when we added O_DIRECT, we assumed
that support for O_DIRECT/opensync could be determined on an OS/kernel
basis, because that was the information we had. Now it turns out that
support can vary *by filesystem* and *between remounts*. We didn't have
any way of knowing different back in 2004, but that doesn't mean we
don't need to fix our mistaken assumption now.

Ideally, we would change our code to test support for O_DIRECT on
startup, rather than at compile time, and backport *that*.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#11)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Josh Berkus <josh@agliodbs.com> writes:

It's a bug and it's our bug.

No, it's a filesystem bug that this particular filesystem doesn't
support a perfectly reasonable combination of options, and doesn't
even fail gracefully as it could easily do. But assigning blame
doesn't help much.

Back when we added O_DIRECT, we assumed
that support for O_DIRECT/opensync could be determined on an OS/kernel
basis, because that was the information we had. Now it turns out that
support can vary *by filesystem* and *between remounts*. We didn't have
any way of knowing different back in 2004, but that doesn't mean we
don't need to fix our mistaken assumption now.

Ideally, we would change our code to test support for O_DIRECT on
startup, rather than at compile time, and backport *that*.

I'm not convinced that a startup-time test would be enough either,
since as you note a remount might be enough to change the situation.

I think the best answer is to get out of the business of using
O_DIRECT by default, especially seeing that available evidence
suggests it might not be a performance win anyway.

regards, tom lane

#13Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#12)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

I think the best answer is to get out of the business of using
O_DIRECT by default, especially seeing that available evidence
suggests it might not be a performance win anyway.

Well, we don't have any performance evidence ... there's an issue with
the fsync-test script which causes it not to use O_DIRECT.

However, we haven't seen any evidence for benefits on any production
filesystem, either. So given the lack of evidence of performance
benefit, combined with the definite evidence of related failures, I
agree that simply disabling O_DIRECT by default would be a good way to
solve this.

It might be nice to add new sync_method options, "osync_odirect" and
"odatasync_odirect" for DBAs who think they know enough to tune with
non-defaults.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On Wednesday 01 December 2010 19:09:05 Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

It's a bug and it's our bug.

No, it's a filesystem bug that this particular filesystem doesn't
support a perfectly reasonable combination of options, and doesn't
even fail gracefully as it could easily do. But assigning blame
doesn't help much.

I wouldnt call it a reasonable combination - promising fs-level data-
journaling (data=journal) and O_DIRECT are not really compatible with each
other...

Andres

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#13)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Josh Berkus <josh@agliodbs.com> writes:

It might be nice to add new sync_method options, "osync_odirect" and
"odatasync_odirect" for DBAs who think they know enough to tune with
non-defaults.

That would have the benefit that we'd not have to argue with people
who liked the current behavior (assuming there are any). I'm not
sure there's much technical advantage, but from a political standpoint
it might be the easiest sort of change to push through.

However, this doesn't really address the question of what a sensible
choice of default is. If there's little evidence about whether the
current flavor of open_datasync is really the fastest way, there's
none whatsoever that establishes open_datasync_without_o_direct
being a sane choice of default.

regards, tom lane

#16Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#15)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

However, this doesn't really address the question of what a sensible
choice of default is. If there's little evidence about whether the
current flavor of open_datasync is really the fastest way, there's
none whatsoever that establishes open_datasync_without_o_direct
being a sane choice of default.

No, I'd switch to fdatasync. That's the performance that most people
are familiar with anyway, since it was all Linux supported before.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#17Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#14)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

On 12/01/2010 01:41 PM, Andres Freund wrote:

On Wednesday 01 December 2010 19:09:05 Tom Lane wrote:

Josh Berkus<josh@agliodbs.com> writes:

It's a bug and it's our bug.

No, it's a filesystem bug that this particular filesystem doesn't
support a perfectly reasonable combination of options, and doesn't
even fail gracefully as it could easily do. But assigning blame
doesn't help much.

I wouldnt call it a reasonable combination - promising fs-level data-
journaling (data=journal) and O_DIRECT are not really compatible with each
other...

OK, but how is an application supposed to know that data journaling is
set. Postgres doesn't even look at the FS type, let alone the mount
options. From the app's POV it's perfectly reasonable. If the OS is
going to provide the API, it should expect people to use it.

cheers

andrew

#18Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#12)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Tom Lane wrote:

I think the best answer is to get out of the business of using
O_DIRECT by default, especially seeing that available evidence
suggests it might not be a performance win anyway.

I was concerned that open_datasync might be doing a better job of
forcing data out of drive write caches. But the tests I've done on
RHEL6 so far suggest that's not true; the write guarantees seem to be
the same as when using fdatasync. And there's certainly one performance
regression possible going from fdatasync to open_datasync, the case
where you're overflowing wal_buffers before you actually commit.

Below is a test of the troublesome behavior on the same RHEL6 system I
gave test_fsync performance test results from at
http://archives.postgresql.org/message-id/4CE2EBF8.4040602@2ndquadrant.com

This confirms that the kernel now defining O_DSYNC behavior as being
available, but not actually supporting it when running the filesystem in
journaled mode, is the problem here. That's clearly a kernel bug and no
fault of PostgreSQL, it's just never been exposed in a default
configuration before. The RedHat bugzilla report seems a bit unclear
about what's going on here, may be worth updating that to note the
underlying cause.

Regardless, I'm now leaning heavily toward the idea of avoiding
open_datasync by default given this bug, and backpatching that change to
at least 8.4. I'll do some more database-level performance tests here
just as a final sanity check on that. My gut feel is now that we'll
eventually be taking something like Marti's patch, adding some more
documentation around it, and applying that to HEAD as well as some
number of back branches.

$ mount | head -n 1
/dev/sda7 on / type ext4 (rw)
$ cat $PGDATA/postgresql.conf | grep wal_sync_method
#wal_sync_method = fdatasync # the default is the first option
$ pg_ctl start
server starting
LOG: database system was shut down at 2010-12-01 17:20:16 EST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
$ psql -c "show wal_sync_method"
wal_sync_method
-----------------
open_datasync

[Edit /etc/fstab, change mount options to be "data=journal" and reboot]

$ mount | grep journal
/dev/sda7 on / type ext4 (rw,data=journal)
$ cat postgresql.conf | grep wal_sync_method
#wal_sync_method = fdatasync # the default is the first option
$ pg_ctl start
server starting
LOG: database system was shut down at 2010-12-01 12:14:50 EST
PANIC: could not open file "pg_xlog/000000010000000000000001" (log file
0, segment 1): Invalid argument
LOG: startup process (PID 2690) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
$ pg_ctl stop

$ vi $PGDATA/postgresql.conf
$ cat $PGDATA/postgresql.conf | grep wal_sync_method
wal_sync_method = fdatasync # the default is the first option
$ pg_ctl start
server starting
LOG: database system was shut down at 2010-12-01 12:14:40 EST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

#19Bruce Momjian
bruce@momjian.us
In reply to: Andrew Dunstan (#10)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

Andrew Dunstan wrote:

On 11/30/2010 11:17 PM, Tom Lane wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

On 11/30/2010 10:09 PM, Tom Lane wrote:

We should wait for the outcome of the discussion about whether to change
the default wal_sync_method before worrying about this.

we've just had a significant PGX customer encounter this with the latest
Postgres on Redhat's freshly released flagship product. Presumably the
default wal_sync_method will only change prospectively.

I don't think so. The fact that Linux is changing underneath us is a
compelling reason for back-patching a change here. Our older branches
still have to be able to run on modern OS versions. I'm also fairly
unclear on what you think a fix would look like if it's not effectively
a change in the default.

(Hint: this *will* be changing, one way or another, in Red Hat's version
of 8.4, since that's what RH is shipping in RHEL6.)

Well, my initial idea was that if PG_O_DIRECT is non-zero, we should
test at startup time if we can use it on the WAL file system and inhibit
its use if not.

Incidentally, I notice it's not used at all in test_fsync.c - should it
not be?

test_fsync certainly should be using PG_O_DIRECT in the same places the
backend does. Once we decide how to handle PG_O_DIRECT, I will modify
test_fsync to match.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#20Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#19)
Re: We really ought to do something about O_DIRECT and data=journalled on ext4

All,

So, I've been doing some reading about this issue, and I think
regardless of what other changes we make we should never enable O_DIRECT
automatically on Linux, and it was a mistake for us to do so in the
first place.

First, in the Linux docs for open():

=========

In summary, O_DIRECT is a potentially powerful tool that should be used
with caution. It is recommended that applications treat use of O_DIRECT
as a performance option which is disabled by default.

=========

Second, Linus has a quote about O_DIRECT that I think should serve as an
indicator to us that directIO will never be beneficial-by-default on
Linux, and might even someday be desupported:

============

The right way to do it is to just not use O_DIRECT.

The whole notion of "direct IO" is totally braindamaged. Just say no.

This is your brain: O
This is your brain on O_DIRECT: .

Any questions?

I should have fought back harder. There really is no valid reason for EVER
using O_DIRECT. You need a buffer whatever IO you do, and it might as well
be the page cache. There are better ways to control the page cache than
play games and think that a page cache isn't necessary.

So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
instead.

Linus
=============

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#18)
#23Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#22)
#24Steve Singer
steve@ssinger.info
In reply to: Greg Smith (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#23)
#26Josh Berkus
josh@agliodbs.com
In reply to: Steve Singer (#24)
#27Josh Berkus
josh@agliodbs.com
In reply to: Greg Smith (#23)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#28)
#30Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#29)
#31Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#29)
#32Steve Singer
steve@ssinger.info
In reply to: Josh Berkus (#26)
#33Marti Raudsepp
marti@juffo.org
In reply to: Tom Lane (#25)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marti Raudsepp (#33)
#35Bruce Momjian
bruce@momjian.us
In reply to: Josh Berkus (#30)