O_DIRECT in freebsd

Started by Christopher Kings-Lynneabout 22 years ago13 messages

chriskl@familyhealth.com.au

about 22 years ago

FreeBSD 4.9 was released today. In the release notes was:

2.2.6 File Systems

A new DIRECTIO kernel option enables support for read operations that
bypass the buffer cache and put data directly into a userland buffer.
This feature requires that the O_DIRECT flag is set on the file
descriptor and that both the offset and length for the read operation
are multiples of the physical media sector size.

Is that of any use?

Chris

Doug McNaught

doug@mcnaught.org

about 22 years ago

In reply to: Christopher Kings-Lynne (#1)

Re: O_DIRECT in freebsd

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

FreeBSD 4.9 was released today. In the release notes was:

2.2.6 File Systems

A new DIRECTIO kernel option enables support for read operations that
bypass the buffer cache and put data directly into a userland
buffer. This feature requires that the O_DIRECT flag is set on the
file descriptor and that both the offset and length for the read
operation are multiples of the physical media sector size.

Is that of any use?

Linux and Solaris have had this for a while. I'm pretty sure it's
been discussed before--search the archives. I think the consensus
was that it might be useful for WAL writes, but would be a fair amount
of work and would introduce portability issues...

-Doug

Import Notes

Reply to msg id not found: ChristopherKings-LynnesmessageofWed29Oct2003220548+0800

scott.marlowe

scott.marlowe@ihs.com

about 22 years ago

In reply to: Doug McNaught (#2)

Re: O_DIRECT in freebsd

On 29 Oct 2003, Doug McNaught wrote:

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

FreeBSD 4.9 was released today. In the release notes was:

2.2.6 File Systems

A new DIRECTIO kernel option enables support for read operations that
bypass the buffer cache and put data directly into a userland
buffer. This feature requires that the O_DIRECT flag is set on the
file descriptor and that both the offset and length for the read
operation are multiples of the physical media sector size.

Is that of any use?

Linux and Solaris have had this for a while. I'm pretty sure it's
been discussed before--search the archives. I think the consensus
was that it might be useful for WAL writes, but would be a fair amount
of work and would introduce portability issues...

I would think the biggest savings could come from using directIO for
vacuuming, so it doesn't cause the kernel to flush buffers.

Would that be just as hard to implement?

Doug McNaught

doug@mcnaught.org

about 22 years ago

In reply to: scott.marlowe (#3)

Re: O_DIRECT in freebsd

"scott.marlowe" <scott.marlowe@ihs.com> writes:

I would think the biggest savings could come from using directIO for
vacuuming, so it doesn't cause the kernel to flush buffers.

Would that be just as hard to implement?

Two words: "cache coherency".

-Doug

Import Notes

Reply to msg id not found: scott.marlowesmessageofWed29Oct2003074157-0700MST

Tom Lane

tgl@sss.pgh.pa.us

about 22 years ago

In reply to: Doug McNaught (#2)

Re: O_DIRECT in freebsd

Doug McNaught <doug@mcnaught.org> writes:

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

A new DIRECTIO kernel option enables support for read operations that
bypass the buffer cache and put data directly into a userland
buffer. This feature requires that the O_DIRECT flag is set on the
file descriptor and that both the offset and length for the read
operation are multiples of the physical media sector size.

Linux and Solaris have had this for a while. I'm pretty sure it's
been discussed before--search the archives. I think the consensus
was that it might be useful for WAL writes, but would be a fair amount
of work and would introduce portability issues...

Not for WAL --- we never read the WAL at all in normal operation. (If
it works for writes, then we would want to use it for writing WAL, but
that's not apparent from what Christopher quoted.)

IIRC there was speculation that this would be useful for large seqscans
and for vacuuming. It'd take some hacking to propagate the knowledge of
that context down to where the fopen occurs, though.

regards, tom lane

Manfred Spraul

manfred@colorfullife.com

about 22 years ago

In reply to: Tom Lane (#5)

Re: O_DIRECT in freebsd

Tom Lane wrote:

Not for WAL --- we never read the WAL at all in normal operation. (If

it works for writes, then we would want to use it for writing WAL, but
that's not apparent from what Christopher quoted.)

At least under Linux, it works for writes. Oracle uses O_DIRECT to
access (both read and write) disks that are shared between multiple
nodes in a cluster - their database kernel must know when the data is
visible to the other nodes.
One problem for WAL is that O_DIRECT would disable the write cache -
each operation would block until the data arrived on disk, and that
might block other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

Linux supports posix_fadvise, it seems to be part of xopen2k.

--
Manfred

Greg Stark

gsstark@mit.edu

about 22 years ago

In reply to: Manfred Spraul (#6)

Re: O_DIRECT in freebsd

Manfred Spraul <manfred@colorfullife.com> writes:

One problem for WAL is that O_DIRECT would disable the write cache -
each operation would block until the data arrived on disk, and that might block
other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.

aio seems a better fit.

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

Linux supports posix_fadvise, it seems to be part of xopen2k.

Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
using to do this tweaking.

This is the only option that seems useful for postgres for both the WAL and
vacuum (though in other threads it seems the problems with vacuum lie
elsewhere):

POSIX_FADV_DONTNEED attempts to free cached pages associated with the
specified region. This is useful, for example, while streaming large
files. A program may periodically request the kernel to free cached
data that has already been used, so that more useful cached pages are
not discarded instead.

Pages that have not yet been written out will be unaffected, so if the
application wishes to guarantee that pages will be released, it should
call fsync or fdatasync first.

Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
backend before starting a sequential scan or index scan, but I kind of doubt
it.

--
greg

Manfred Spraul

manfred@colorfullife.com

about 22 years ago

In reply to: Greg Stark (#7)

Re: O_DIRECT in freebsd

Greg Stark wrote:

Manfred Spraul <manfred@colorfullife.com> writes:

One problem for WAL is that O_DIRECT would disable the write cache -
each operation would block until the data arrived on disk, and that might block
other backends that try to access WALWriteLock.
Perhaps a dedicated backend that does the writeback could fix that.

aio seems a better fit.

Has anyone tried to use posix_fadvise for the wal logs?
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

Linux supports posix_fadvise, it seems to be part of xopen2k.

Odd, I don't see it anywhere in the kernel. I don't know what syscall it's
using to do this tweaking.

At least in 2.6: linux/mm/fadvise.c, the syscall is fadvise64 or 64_64

This is the only option that seems useful for postgres for both the WAL and
vacuum (though in other threads it seems the problems with vacuum lie
elsewhere):

POSIX_FADV_DONTNEED attempts to free cached pages associated with the
specified region. This is useful, for example, while streaming large
files. A program may periodically request the kernel to free cached
data that has already been used, so that more useful cached pages are
not discarded instead.

Pages that have not yet been written out will be unaffected, so if the
application wishes to guarantee that pages will be released, it should
call fsync or fdatasync first.

I agree. Either immediately after each flush syscall, or just before
closing a log file and switching to the next.

Perhaps POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL could be useful in a
backend before starting a sequential scan or index scan, but I kind of doubt
it.

IIRC the recommendation is ~20% total memory for the postgres user space
buffers. That's quite a lot - it might be sufficient to protect that
cache from vacuum or sequential scans. AddBufferToFreeList already
contains a comment that this is the right place to try buffer
replacement strategies.

--
Manfred

Sailesh Krishnamurthy

sailesh@cs.berkeley.edu

about 22 years ago

In reply to: Doug McNaught (#2)

Re: O_DIRECT in freebsd

DB2 supports cooked and raw file systems - SMS (System Manged Space)
and DMS (Database Managed Space) tablespaces.

The DB2 experience is that DMS tends to outperform SMS but requires
considerable tuning and administrative overhead to see these wins.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#10

Jordan Henderson

jordan_henders@yahoo.com

about 22 years ago

In reply to: Sailesh Krishnamurthy (#9)

Re: O_DIRECT in freebsd

My experience with DB2 showed that properly setup DMS tablespaces provided a
significant performance benefit. I have also seen that the average DBA does
not generally understand the data or access patterns in the database. Given
that, they don't correctly setup table spaces in general, filesystem or raw.
Likewise, where it is possible to tie a tablespace to a memory buffer pool,
the average DBA does not setup it up to a performance advantage either.
However, are we talking about well tuned setups by someone who does
understand the data and the general access patterns? For a DBA like that,
they should be able to take advantage of these features and get significantly
better results. I would not say it requires considerable tuning, but an
understanding of data, storage and access patterns. Additionally, these
features did not cause our group considerable administrative overhead.

Jordan Henderson

Show quoted text

On Thursday 30 October 2003 12:55, Sailesh Krishnamurthy wrote:

DB2 supports cooked and raw file systems - SMS (System Manged Space)
and DMS (Database Managed Space) tablespaces.

The DB2 experience is that DMS tends to outperform SMS but requires
considerable tuning and administrative overhead to see these wins.

#11

Dann Corbit

DCorbit@connx.com

about 22 years ago

In reply to: Jordan Henderson (#10)

Re: O_DIRECT in freebsd

-----Original Message-----
From: Jordan Henderson [mailto:jordan_henders@yahoo.com]
Sent: Thursday, October 30, 2003 4:31 PM
To: sailesh@cs.berkeley.edu; Doug McNaught
Cc: Christopher Kings-Lynne; PostgreSQL-development
Subject: Re: [HACKERS] O_DIRECT in freebsd

My experience with DB2 showed that properly setup DMS
tablespaces provided a
significant performance benefit. I have also seen that the
average DBA does
not generally understand the data or access patterns in the
database. Given
that, they don't correctly setup table spaces in general,
filesystem or raw.
Likewise, where it is possible to tie a tablespace to a
memory buffer pool,
the average DBA does not setup it up to a performance
advantage either.
However, are we talking about well tuned setups by someone who does
understand the data and the general access patterns? For a
DBA like that,
they should be able to take advantage of these features and
get significantly
better results. I would not say it requires considerable
tuning, but an
understanding of data, storage and access patterns.
Additionally, these
features did not cause our group considerable administrative overhead.

If it is possible for a human with knowledge of this domain to make good
decisions, it ought to be possible to store the same information into an
algorithm that operates off of collected statistics. After some time
has elapsed, and an average access pattern of some sort has been
reached, the available resources could be divided in a fairly efficient
way. It might be nice to be able to tweak it, but I would rather have
the computer make the calculations for me.

Just a thought.

Import Notes

Resolved by subject fallback

#12

Sailesh Krishnamurthy

sailesh@cs.berkeley.edu

about 22 years ago

In reply to: Jordan Henderson (#10)

Re: O_DIRECT in freebsd

"Jordan" == Jordan Henderson <jordan_henders@yahoo.com> writes:

Jordan> significantly better results. I would not say it requires
Jordan> considerable tuning, but an understanding of data, storage
Jordan> and access patterns. Additionally, these features did not
Jordan> cause our group considerable administrative overhead.

I won't dispute the specifics. I have only worked on the DB2 engine -
never written an app for it nor administered it. You're right - the
bottomline is that you can get a significant performance advantage
provided you care enough to understand what's going on.

Anyway, I merely responded to provide a data point. Will PostgreSQL
users/administrators care for additional knobs or is there a
preference for "keep it simple, stupid" ?

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#13

Jordan Henderson

jordan_henders@yahoo.com

about 22 years ago

In reply to: Sailesh Krishnamurthy (#12)

Re: O_DIRECT in freebsd

Personally, I think it is useful to have features. I quite understand the
difficulties in maintaining some features however. Also having worked on
internals for commercial DB engines, I have specifically how code/data paths
can be shortened. I would not make the choice for someone to be forced into
using a product in a specific manner. Instead, I would let them decide
whether to choose a simple setup or, if they are up to it, something with
better performance. I would not prune the options out. In doing so, we
limit what a knowledgeable person can do a priori.

Jordan Henderson

Show quoted text

On Thursday 30 October 2003 19:59, Sailesh Krishnamurthy wrote:

"Jordan" == Jordan Henderson <jordan_henders@yahoo.com> writes:

Jordan> significantly better results. I would not say it requires
Jordan> considerable tuning, but an understanding of data, storage
Jordan> and access patterns. Additionally, these features did not
Jordan> cause our group considerable administrative overhead.

I won't dispute the specifics. I have only worked on the DB2 engine -
never written an app for it nor administered it. You're right - the
bottomline is that you can get a significant performance advantage
provided you care enough to understand what's going on.

Anyway, I merely responded to provide a data point. Will PostgreSQL
users/administrators care for additional knobs or is there a
preference for "keep it simple, stupid" ?