O_DIRECT use

Started by Bruce Momjianabout 24 years ago11 messages
#1Bruce Momjian
pgman@candle.pha.pa.us

I have added this item to TODO:

* Consider use of open/fctl(O_DIRECT) to minimize OS caching

Web shows it minimized file system caching, perhaps for sequential
scans:

http://archives2.us.postgresql.org/pgsql-hackers/2001-09/msg00713.php

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: O_DIRECT use

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have added this item to TODO:
* Consider use of open/fctl(O_DIRECT) to minimize OS caching

Why exactly would we wish to minimize OS caching?

In my mind, Postgres has always relied heavily on the existence of a
layer of kernel caching. Disabling that will hurt far more than help.

regards, tom lane

#3Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#2)
Re: O_DIRECT use

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I have added this item to TODO:
* Consider use of open/fctl(O_DIRECT) to minimize OS caching

Why exactly would we wish to minimize OS caching?

In my mind, Postgres has always relied heavily on the existence of a
layer of kernel caching. Disabling that will hurt far more than help.

Not sure. Someone on IRC brought it up. If we are sequential scanning a
large table, caching may be bad because we are pushing out stuff already
in the cache that may be useful. It is related to this TODO item:

* Add free-behind capability for large sequential scans (Bruce)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
Re: O_DIRECT use

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

Why exactly would we wish to minimize OS caching?

Not sure. Someone on IRC brought it up. If we are sequential scanning a
large table, caching may be bad because we are pushing out stuff already
in the cache that may be useful.

Yeah, but people normally try to set things up to avoid doing large
sequential scans, at least in all the contexts where they need high
performance. For index searches you definitely want all the caching
you can get.

For that matter, I would expect that O_DIRECT also defeats readahead,
so I'd fully expect it to be a loser for seqscans too.

regards, tom lane

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: O_DIRECT use

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

Why exactly would we wish to minimize OS caching?

Not sure. Someone on IRC brought it up. If we are sequential scanning a
large table, caching may be bad because we are pushing out stuff already
in the cache that may be useful.

Yeah, but people normally try to set things up to avoid doing large
sequential scans, at least in all the contexts where they need high
performance. For index searches you definitely want all the caching
you can get.

For that matter, I would expect that O_DIRECT also defeats readahead,
so I'd fully expect it to be a loser for seqscans too.

I am told on FreeBSD it does not disable read-ahead, just caching;
something that needs more research.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Brent Verner
brent@rcfile.org
In reply to: Bruce Momjian (#3)
Re: O_DIRECT use

[2002-01-04 16:31] Bruce Momjian said:

| Not sure. Someone on IRC brought it up.

Is there a pg IRC channel? What is the server?

cheers.
brent

--
"Develop your talent, man, and leave the world something. Records are
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing." -- Duane Allman

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#5)
Re: O_DIRECT use

Bruce Momjian <pgman@candle.pha.pa.us> writes:

For that matter, I would expect that O_DIRECT also defeats readahead,
so I'd fully expect it to be a loser for seqscans too.

I am told on FreeBSD it does not disable read-ahead, just caching;
something that needs more research.

Hmm. I always thought of read-ahead as preloading buffer cache entries.

It'd be interesting to get a description of *exactly* what this flag
does, rather than handwavy approximations. Time to start reading the
kernel code, I suppose.

regards, tom lane

#8Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#7)
Re: O_DIRECT use

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

For that matter, I would expect that O_DIRECT also defeats readahead,
so I'd fully expect it to be a loser for seqscans too.

I am told on FreeBSD it does not disable read-ahead, just caching;
something that needs more research.

Hmm. I always thought of read-ahead as preloading buffer cache entries.

It'd be interesting to get a description of *exactly* what this flag
does, rather than handwavy approximations. Time to start reading the
kernel code, I suppose.

I found this before adding the item:

http://www.pairlist.net/pipermail/flow-tools/2001-October/000058.html

And this for FreeBSD 4.4:

2.1 Kernel Changes

The O_DIRECT flag has been added to open(2) and fcntl(2). Specifying this
flag for open files will attempt to minimize the cache effects of reading
and writing.

I also found:

http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html

These later ones seem to indicate there isn't read-ahead, meaning we
would have to do our own prefetches. Eck. I am unclear if that is true
on all OS's.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Brent Verner (#6)
Re: O_DIRECT use

Brent Verner wrote:

[2002-01-04 16:31] Bruce Momjian said:

| Not sure. Someone on IRC brought it up.

Is there a pg IRC channel? What is the server?

See FAQ item 1.6.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Brent Verner (#6)
Re: O_DIRECT use

Brent Verner wrote:

[2002-01-04 16:31] Bruce Momjian said:

| Not sure. Someone on IRC brought it up.

Is there a pg IRC channel? What is the server?

FAQ item text is:

<P>There is also an IRC channel on EFNet, channel
<I>#PostgreSQL.</I> I use the unix command <CODE>irc -c
'#PostgreSQL' "$USER" irc.phoenix.net.</CODE></P>

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#11Matthew Kirkwood
matthew@hairy.beasts.org
In reply to: Bruce Momjian (#8)
Re: O_DIRECT use

On Fri, 4 Jan 2002, Bruce Momjian wrote:

For that matter, I would expect that O_DIRECT also defeats readahead,
so I'd fully expect it to be a loser for seqscans too.

And this for FreeBSD 4.4:

The O_DIRECT flag has been added to open(2) and fcntl(2). Specifying this
flag for open files will attempt to minimize the cache effects of reading
and writing.

This seems rather vague. Can any FreeBSD person here say
whether the semantics are any stronger?

http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-o_direct.html

These later ones seem to indicate there isn't read-ahead, meaning we
would have to do our own prefetches. Eck. I am unclear if that is
true on all OS's.

The Linux O_DIRECT semantics are intended to be harder.
In essence, the kernel _will not cache_ data read from
or written to such a file or device.

The point of this, incidentally, was to be able to run
things like Oracle Parallel Server and other shared-
disk setups. It's use as an "I don't need this cached"
mechanism is secondary, and rather sub-optimal, as seen
here; you disable software read-ahead and introduce
coherence issues with non-O_DIRECT openers of the file.
(I'm not sure of the precise Linux semantics of this,
but it's probably fair to say that you may as well
consider them undefined.)

Linux 2.4 has "madvise", but unfortunately no matching
"fadvise". A quick Google implied that FreeBSD is in
the same boat.

Matthew.