O_DIRECT in freebsd

sean@chittenden.org

almost 23 years ago

In reply to: Christopher Kings-Lynne (#1)

Re: O_DIRECT in freebsd

I noticed this in the FreeBSD 5.1 release notes:

"A new DIRECTIO kernel option enables support for read operations
that bypass the buffer cache and put data directly into a userland
buffer. This feature requires that the O_DIRECT flag is set on the
file descriptor and that both the offset and length for the read
operation are multiples of the physical media sector size. [MERGED]"

MERGED means that it should also appear in FreeBSD 4.9.

Will PostgreSQL pick this up automatically, or do we need to add
extra checks?

Extra checks, though I'm not sure why you'd want this. This is the
equiv of a nice way of handling raw IO for read only
operations... which would be bad. Call me crazy, but unless you're on
an embedded device with a non-existent FS cache or are Oracle and are
handling the IO buffer in user space, the buffer cache is what helps
speed up PostgreSQL since PostgreSQL leaves all of the caching
operations and optimization/alignment of pages up to the OS (much to
the behest of madvise() which could potentially speed up access of the
VM, but I won't get into that... see TODO.mmap).

-sc

--
Sean Chittenden

Christopher Kings-Lynne

chriskl@familyhealth.com.au

almost 23 years ago

In reply to: Christopher Kings-Lynne (#1)

Re: O_DIRECT in freebsd

Will PostgreSQL pick this up automatically, or do we need to add
extra checks?

Extra checks, though I'm not sure why you'd want this. This is the
equiv of a nice way of handling raw IO for read only
operations... which would be bad. Call me crazy, but unless you're on

The reason I mention it is that Postgres already supports O_DIRECT I think
on some other platforms (for whatever reason).

Chris

Curt Sampson

cjs@cynic.net

almost 23 years ago

In reply to: Christopher Kings-Lynne (#1)

Re: O_DIRECT in freebsd

On Tue, 17 Jun 2003, Christopher Kings-Lynne wrote:

"A new DIRECTIO kernel option enables support for read operations that
bypass the buffer cache and put data directly into a userland buffer....

Will PostgreSQL pick this up automatically, or do we need to add extra
checks?

You don't want it to. It's more efficent just to use mmap, because then
all the paging and caching issues are taken care of for you.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Christopher Kings-Lynne (#3)

Re: O_DIRECT in freebsd

"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:

The reason I mention it is that Postgres already supports O_DIRECT I think
on some other platforms (for whatever reason).

[ sounds of grepping... ] No. The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason Sean
cited: Postgres is designed and tuned to rely on OS-level disk caching,
and bypassing that cache is far more likely to hurt our performance than
help it.

However, if someone wants to do some experimentation with O_DIRECT, I'd
be as interested as anyone to find out what happens...

regards, tom lane

Gavin Sherry

swm@linuxworld.com.au

almost 23 years ago

In reply to: Tom Lane (#5)

Re: O_DIRECT in freebsd

On Tue, 17 Jun 2003, Tom Lane wrote:

"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:

The reason I mention it is that Postgres already supports O_DIRECT I think
on some other platforms (for whatever reason).

[ sounds of grepping... ] No. The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason Sean
cited: Postgres is designed and tuned to rely on OS-level disk caching,
and bypassing that cache is far more likely to hurt our performance than
help it.

DB2 and Oracle, from memory, allow users to pass hints to the planner to
use/not use file system caching. This could be useful if you had an
application retrieving a large amount of data on an adhoc basis. The large
retrieval would empty out the disk cache there by negatively impacting
upon other applications operating on data which should be cached.

Gavin

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 23 years ago

In reply to: Gavin Sherry (#6)

Re: O_DIRECT in freebsd

On Wed, Jun 18, 2003 at 10:01:37AM +1000, Gavin Sherry wrote:

On Tue, 17 Jun 2003, Tom Lane wrote:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason Sean
cited: Postgres is designed and tuned to rely on OS-level disk caching,
and bypassing that cache is far more likely to hurt our performance than
help it.

DB2 and Oracle, from memory, allow users to pass hints to the planner to
use/not use file system caching. This could be useful if you had an
application retrieving a large amount of data on an adhoc basis. The large
retrieval would empty out the disk cache there by negatively impacting
upon other applications operating on data which should be cached.

Might it make sense to do this for on-disk sorts, since sort_mem is
essentially being used as a disk cache (at least for reads)?
--
Jim C. Nasby (aka Decibel!) jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Jim Nasby (#7)

Re: O_DIRECT in freebsd

"Jim C. Nasby" <jim@nasby.net> writes:

DB2 and Oracle, from memory, allow users to pass hints to the planner to
use/not use file system caching.

Might it make sense to do this for on-disk sorts, since sort_mem is
essentially being used as a disk cache (at least for reads)?

If sort_mem were actually being used that way, it might be ... but it
isn't, and so I doubt O_DIRECT would be an improvement. It seems more
likely to force disk I/O that otherwise might not happen at all, if the
kernel happens to have sufficient buffer space on hand.

I'll concede though that a large sort would probably have the effect of
blowing out the kernel's disk cache. So while O_DIRECT might be a net
pessimization as far as the sort itself is concerned, it would probably
be more friendly to the rest of the system, by leaving disk buffers free
for more productive uses. It'd all depend on your workload ...

regards, tom lane

bruce@momjian.us

almost 23 years ago

In reply to: Tom Lane (#8)

Re: O_DIRECT in freebsd

Also, keep in mind writes to O_DIRECT devices have to wait for the data
to get on the platters rather than into the kernel cache.

---------------------------------------------------------------------------

Tom Lane wrote:

"Jim C. Nasby" <jim@nasby.net> writes:

DB2 and Oracle, from memory, allow users to pass hints to the planner to
use/not use file system caching.

Might it make sense to do this for on-disk sorts, since sort_mem is
essentially being used as a disk cache (at least for reads)?

If sort_mem were actually being used that way, it might be ... but it
isn't, and so I doubt O_DIRECT would be an improvement. It seems more
likely to force disk I/O that otherwise might not happen at all, if the
kernel happens to have sufficient buffer space on hand.

I'll concede though that a large sort would probably have the effect of
blowing out the kernel's disk cache. So while O_DIRECT might be a net
pessimization as far as the sort itself is concerned, it would probably
be more friendly to the rest of the system, by leaving disk buffers free
for more productive uses. It'd all depend on your workload ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#10

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 23 years ago

In reply to: Gavin Sherry (#6)

Re: O_DIRECT in freebsd

On Wed, Jun 18, 2003 at 10:01:37AM +1000, Gavin Sherry wrote:

On Tue, 17 Jun 2003, Tom Lane wrote:

"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:

The reason I mention it is that Postgres already supports O_DIRECT I think
on some other platforms (for whatever reason).

[ sounds of grepping... ] No. The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason Sean
cited: Postgres is designed and tuned to rely on OS-level disk caching,
and bypassing that cache is far more likely to hurt our performance than
help it.

DB2 and Oracle, from memory, allow users to pass hints to the planner to
use/not use file system caching. This could be useful if you had an
application retrieving a large amount of data on an adhoc basis. The large
retrieval would empty out the disk cache there by negatively impacting
upon other applications operating on data which should be cached.

I've recently been bitten by this. On DB2, I could change what
bufferpool the large tables were using and set it fairly small, but
obviously not an option with PGSQL. But, if pgsql could stop caching
from occuring on user-specified queries, large table or index scans,
etc., it would be very helpful.
--
Jim C. Nasby (aka Decibel!) jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#11

sean@chittenden.org

almost 23 years ago

In reply to: Jim Nasby (#10)

Re: O_DIRECT in freebsd

The reason I mention it is that Postgres already supports
O_DIRECT I think on some other platforms (for whatever
reason).

[ sounds of grepping... ] No. The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason
Sean cited: Postgres is designed and tuned to rely on OS-level
disk caching, and bypassing that cache is far more likely to
hurt our performance than help it.

DB2 and Oracle, from memory, allow users to pass hints to the
planner to use/not use file system caching. This could be useful
if you had an application retrieving a large amount of data on an
adhoc basis. The large retrieval would empty out the disk cache
there by negatively impacting upon other applications operating on
data which should be cached.

I've recently been bitten by this. On DB2, I could change what
bufferpool the large tables were using and set it fairly small, but
obviously not an option with PGSQL. But, if pgsql could stop caching
from occuring on user-specified queries, large table or index scans,
etc., it would be very helpful.

Actually, now that I think about this, if the planner is going to read
more than X number of bytes as specified in a GUC, it would be useful
to have the fd marked as O_DIRECT to avoid polluting the disk
cache... I have a few tables with about 300M rows (~9GB on disk) that
I have to perform nightly seq scans over for reports and it does wipe
out some of the other fast movers that come through and depend on the
disk cache to be there for their speed. Because they're performed in
the middle of the night, I don't care that much, but my avg query
times during that period of time are slower... whether it's load or
the disk buffer being emptied and having to be refilled, I'm not sure,
but thinking about it, use of a GUC threshold to have an FD marked as
O_DIRECT does make sense (0 == disabled and the default, but tunable
in Kbytes as an admin sees fit) and could be nice for big queries that
have lots of smaller queries running around at the same time.

-sc

--
Sean Chittenden

#12

bruce@momjian.us

almost 23 years ago

In reply to: Sean Chittenden (#11)

Re: O_DIRECT in freebsd

What you really want is Solaris's free-behind, where it detects if a
scan is exceeding a certain percentage of the OS cache and moves the
pages to the _front_ of the to-be-reused list. I am not sure what other
OS's support this, but we need this on our own buffer manager code as
well.

Our TODO already has:

* Add free-behind capability for large sequential scans (Bruce)

Basically, I think we need free-behind rather than O_DIRECT.

---------------------------------------------------------------------------

Sean Chittenden wrote:

The reason I mention it is that Postgres already supports
O_DIRECT I think on some other platforms (for whatever
reason).

[ sounds of grepping... ] No. The only occurrence of O_DIRECT in the
source tree is in TODO:

* Consider use of open/fcntl(O_DIRECT) to minimize OS caching

I personally disagree with this TODO item for the same reason
Sean cited: Postgres is designed and tuned to rely on OS-level
disk caching, and bypassing that cache is far more likely to
hurt our performance than help it.

DB2 and Oracle, from memory, allow users to pass hints to the
planner to use/not use file system caching. This could be useful
if you had an application retrieving a large amount of data on an
adhoc basis. The large retrieval would empty out the disk cache
there by negatively impacting upon other applications operating on
data which should be cached.

I've recently been bitten by this. On DB2, I could change what
bufferpool the large tables were using and set it fairly small, but
obviously not an option with PGSQL. But, if pgsql could stop caching
from occuring on user-specified queries, large table or index scans,
etc., it would be very helpful.

Actually, now that I think about this, if the planner is going to read
more than X number of bytes as specified in a GUC, it would be useful
to have the fd marked as O_DIRECT to avoid polluting the disk
cache... I have a few tables with about 300M rows (~9GB on disk) that
I have to perform nightly seq scans over for reports and it does wipe
out some of the other fast movers that come through and depend on the
disk cache to be there for their speed. Because they're performed in
the middle of the night, I don't care that much, but my avg query
times during that period of time are slower... whether it's load or
the disk buffer being emptied and having to be refilled, I'm not sure,
but thinking about it, use of a GUC threshold to have an FD marked as
O_DIRECT does make sense (0 == disabled and the default, but tunable
in Kbytes as an admin sees fit) and could be nice for big queries that
have lots of smaller queries running around at the same time.

-sc

--
Sean Chittenden

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#13

sean@chittenden.org

almost 23 years ago

In reply to: Bruce Momjian (#12)

Re: O_DIRECT in freebsd

What you really want is Solaris's free-behind, where it detects if a
scan is exceeding a certain percentage of the OS cache and moves the
pages to the _front_ of the to-be-reused list. I am not sure what
other OS's support this, but we need this on our own buffer manager
code as well.

Our TODO already has:

* Add free-behind capability for large sequential scans (Bruce)

Basically, I think we need free-behind rather than O_DIRECT.

I suppose, but you've already polluted the cache by the time the above
mentioned mechanism kicks in and takes effect. Given that the planner
has an idea of how much data it's going to read in in order to
complete the query, seems easier/better to mark the fd O_DIRECT.
*shrug*

-sc

--
Sean Chittenden

#14

bruce@momjian.us

almost 23 years ago

In reply to: Sean Chittenden (#13)

Re: O_DIRECT in freebsd

Sean Chittenden wrote:

What you really want is Solaris's free-behind, where it detects if a
scan is exceeding a certain percentage of the OS cache and moves the
pages to the _front_ of the to-be-reused list. I am not sure what
other OS's support this, but we need this on our own buffer manager
code as well.

Our TODO already has:

* Add free-behind capability for large sequential scans (Bruce)

Basically, I think we need free-behind rather than O_DIRECT.

I suppose, but you've already polluted the cache by the time the above
mentioned mechanism kicks in and takes effect. Given that the planner
has an idea of how much data it's going to read in in order to
complete the query, seems easier/better to mark the fd O_DIRECT.
*shrug*

_That_ is an excellent point. However, do we know at the time we open
the file descriptor if we will be doing this? What about cache
coherency problems with other backends not opening with O_DIRECT? And
finally, how do we deal with the fact that writes to O_DIRECT files will
wait until the data hits the disk because there is no kernel buffer cache?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#15

sean@chittenden.org

almost 23 years ago

In reply to: Bruce Momjian (#14)

Re: O_DIRECT in freebsd

What you really want is Solaris's free-behind, where it detects
if a scan is exceeding a certain percentage of the OS cache and
moves the pages to the _front_ of the to-be-reused list. I am
not sure what other OS's support this, but we need this on our
own buffer manager code as well.

Our TODO already has:

* Add free-behind capability for large sequential scans (Bruce)

Basically, I think we need free-behind rather than O_DIRECT.

I suppose, but you've already polluted the cache by the time the
above mentioned mechanism kicks in and takes effect. Given that
the planner has an idea of how much data it's going to read in in
order to complete the query, seems easier/better to mark the fd
O_DIRECT. *shrug*

_That_ is an excellent point. However, do we know at the time we
open the file descriptor if we will be doing this?

Doesn't matter, it's an option to fcntl().

What about cache coherency problems with other backends not opening
with O_DIRECT?

That's a problem for the kernel VM, if you mean cache coherency in the
VM. If you mean inside of the backend, that could be a stickier
issue, I think. I don't know enough of the internals yet to know if
this is a problem or not, but you're right, it's certainly something
to consider. Is the cache a write behind cache or is it a read
through cache? If it's a read through cache, which I think it is,
then the backend would have to dirty all cache entries pertaining to
the relations being opened with O_DIRECT. The use case for that
being:

1) a transaction begins
2) a few rows out of the huge table are read
3) a huge query is performed that triggers the use of O_DIRECT
4) the rows selected in step 2 are updated (this step should poison or
update the cache, actually, and act as a write through cache if the
data is in the cache)
5) the same few rows are read in again
6) transaction is committed

Provided the cache is poisoned or updated in step 4, I can't see how
or where this would be a problem. Please enlighten if there's a
different case that would need to be taken into account. I can't
imagine ever wanting to write out data using O_DIRECT and think that
it's a read only optimization in an attempt to minimize the turn over
in the OS's cache. From fcntl(2):

O_DIRECT Minimize or eliminate the cache effects of reading and writ-
ing. The system will attempt to avoid caching the data you
read or write. If it cannot avoid caching the data, it will
minimize the impact the data has on the cache. Use of this
flag can drastically reduce performance if not used with
care.

And finally, how do we deal with the fact that writes to O_DIRECT
files will wait until the data hits the disk because there is no
kernel buffer cache?

Well, two things.

1) O_DIRECT should never be used on writes... I can't think of a case
where you'd want it off, even when COPY'ing data and restoring a
DB, it just doesn't make sense to use it. The write buffer is
emptied as soon as the pages hit the disk unless something is
reading those bits, but I'd imagine the write buffer would be used
to make sure that as much writing is done to the platter in a
single write by the OS as possible, circumventing that would be
insane (though useful possibly for embedded devices with low RAM,
solid state drives, or some super nice EMC fiber channel storage
device that basically has its own huge disk cache).

2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
non-blocking IO. The backend would block until the call returns,
where's the problem? :)

If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
code, if there are any. -sc

--
Sean Chittenden

#16

bruce@momjian.us

almost 23 years ago

In reply to: Sean Chittenden (#15)

Re: O_DIRECT in freebsd

Basically, we don't know when we read a buffer whether this is a
read-only or read/write. In fact, we could read it in, and another
backend could write it for us.

The big issue is that when we do a write, we don't wait for it to get to
disk.

It seems to use O_DIRECT, we would have to read the buffer in a special
way to mark it as read-only, which seems kind of strange. I see no
reason we can't use free-behind in the PostgreSQL buffer cache to handle
most of the benefits of O_DIRECT, without the read-only buffer restriction.

---------------------------------------------------------------------------

Sean Chittenden wrote:

_That_ is an excellent point. However, do we know at the time we
open the file descriptor if we will be doing this?

Doesn't matter, it's an option to fcntl().

What about cache coherency problems with other backends not opening
with O_DIRECT?

That's a problem for the kernel VM, if you mean cache coherency in the
VM. If you mean inside of the backend, that could be a stickier
issue, I think. I don't know enough of the internals yet to know if
this is a problem or not, but you're right, it's certainly something
to consider. Is the cache a write behind cache or is it a read
through cache? If it's a read through cache, which I think it is,
then the backend would have to dirty all cache entries pertaining to
the relations being opened with O_DIRECT. The use case for that
being:

1) a transaction begins
2) a few rows out of the huge table are read
3) a huge query is performed that triggers the use of O_DIRECT
4) the rows selected in step 2 are updated (this step should poison or
update the cache, actually, and act as a write through cache if the
data is in the cache)
5) the same few rows are read in again
6) transaction is committed

Provided the cache is poisoned or updated in step 4, I can't see how
or where this would be a problem. Please enlighten if there's a
different case that would need to be taken into account. I can't
imagine ever wanting to write out data using O_DIRECT and think that
it's a read only optimization in an attempt to minimize the turn over
in the OS's cache. From fcntl(2):

O_DIRECT Minimize or eliminate the cache effects of reading and writ-
ing. The system will attempt to avoid caching the data you
read or write. If it cannot avoid caching the data, it will
minimize the impact the data has on the cache. Use of this
flag can drastically reduce performance if not used with
care.

And finally, how do we deal with the fact that writes to O_DIRECT
files will wait until the data hits the disk because there is no
kernel buffer cache?

Well, two things.

1) O_DIRECT should never be used on writes... I can't think of a case
where you'd want it off, even when COPY'ing data and restoring a
DB, it just doesn't make sense to use it. The write buffer is
emptied as soon as the pages hit the disk unless something is
reading those bits, but I'd imagine the write buffer would be used
to make sure that as much writing is done to the platter in a
single write by the OS as possible, circumventing that would be
insane (though useful possibly for embedded devices with low RAM,
solid state drives, or some super nice EMC fiber channel storage
device that basically has its own huge disk cache).

2) Last I checked PostgreSQL wasn't a threaded app and doesn't use
non-blocking IO. The backend would block until the call returns,
where's the problem? :)

If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
code, if there are any. -sc

--
Sean Chittenden

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#17

sean@chittenden.org

almost 23 years ago

In reply to: Bruce Momjian (#16)

Re: O_DIRECT in freebsd

Basically, we don't know when we read a buffer whether this is a
read-only or read/write. In fact, we could read it in, and another
backend could write it for us.

Um, wait. The cache is shared between backends? I don't think so,
but it shouldn't matter because there has to be a semaphore locking
the cache to prevent the coherency issue you describe. If PostgreSQL
didn't, it'd be having problems with this now. I'd also think that
MVCC would handle the case of updated data in the cache as that has to
be a common case. At what point is the cached result invalidated and
fetched from the OS?

The big issue is that when we do a write, we don't wait for it to
get to disk.

Only in the case when fsync() is turned off, but again, that's up to
the OS to manage that can of worms, which I think BSD takes care of
that. From conf/NOTES:

# Attempt to bypass the buffer cache and put data directly into the
# userland buffer for read operation when O_DIRECT flag is set on the
# file. Both offset and length of the read operation must be
# multiples of the physical media sector size.
#
#options DIRECTIO

The offsets and length bit kinda bothers me though, but I thin that's
stuff that the ernel has to take into account, not the userland calls,
I wonder if that's actually accurate any more or affects userland
calls... seems like that'd be a bit too much work to have the user
do, esp given the lack of documentation on the flag... should be just
drop in additional flag, afaict.

It seems to use O_DIRECT, we would have to read the buffer in a
special way to mark it as read-only, which seems kind of strange. I
see no reason we can't use free-behind in the PostgreSQL buffer
cache to handle most of the benefits of O_DIRECT, without the
read-only buffer restriction.

I don't see how this'd be an issue as buffers populated via a read(),
that are updated, and then written out, would occupy a new chunk of
disk to satisfy MVCC. Why would we need to mark a buffer as read only
and carry around/check its state?

-sc

--
Sean Chittenden

#18

bruce@momjian.us

almost 23 years ago

In reply to: Sean Chittenden (#17)

Re: O_DIRECT in freebsd

Sean Chittenden wrote:

Basically, we don't know when we read a buffer whether this is a
read-only or read/write. In fact, we could read it in, and another
backend could write it for us.

Um, wait. The cache is shared between backends? I don't think so,
but it shouldn't matter because there has to be a semaphore locking
the cache to prevent the coherency issue you describe. If PostgreSQL
didn't, it'd be having problems with this now. I'd also think that
MVCC would handle the case of updated data in the cache as that has to
be a common case. At what point is the cached result invalidated and
fetched from the OS?

Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
lock pages only while we are reading/writing them, not for the duration
they are in the cache.

The big issue is that when we do a write, we don't wait for it to
get to disk.

Only in the case when fsync() is turned off, but again, that's up to
the OS to manage that can of worms, which I think BSD takes care of
that. From conf/NOTES:

Nope. When you don't have a kernel buffer cache, and you do a write,
where do you expect it to go? I assume it goes to the drive, and you
have to wait for that.

# Attempt to bypass the buffer cache and put data directly into the
# userland buffer for read operation when O_DIRECT flag is set on the
# file. Both offset and length of the read operation must be
# multiples of the physical media sector size.
#
#options DIRECTIO

The offsets and length bit kinda bothers me though, but I thin that's
stuff that the ernel has to take into account, not the userland calls,
I wonder if that's actually accurate any more or affects userland
calls... seems like that'd be a bit too much work to have the user
do, esp given the lack of documentation on the flag... should be just
drop in additional flag, afaict.

It seems to use O_DIRECT, we would have to read the buffer in a
special way to mark it as read-only, which seems kind of strange. I
see no reason we can't use free-behind in the PostgreSQL buffer
cache to handle most of the benefits of O_DIRECT, without the
read-only buffer restriction.

I don't see how this'd be an issue as buffers populated via a read(),
that are updated, and then written out, would occupy a new chunk of
disk to satisfy MVCC. Why would we need to mark a buffer as read only
and carry around/check its state?

We update the expired flags on the tuple during update/delete.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#19

sean@chittenden.org

almost 23 years ago

In reply to: Bruce Momjian (#18)

Re: O_DIRECT in freebsd

Basically, we don't know when we read a buffer whether this is a
read-only or read/write. In fact, we could read it in, and
another backend could write it for us.

Um, wait. The cache is shared between backends? I don't think
so, but it shouldn't matter because there has to be a semaphore
locking the cache to prevent the coherency issue you describe. If
PostgreSQL didn't, it'd be having problems with this now. I'd
also think that MVCC would handle the case of updated data in the
cache as that has to be a common case. At what point is the
cached result invalidated and fetched from the OS?

Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
lock pages only while we are reading/writing them, not for the duration
they are in the cache.

*smacks forhead* Duh, you're right. I always just turn up the FS
cache in the OS instead.

The shared buffer cache has got to have enormous churn though if
everything ends up in the userland cache. Is it really an exhaustive
cache? I thought the bulk of the caching happened in the kernel and
not in the userland. Is the userland cache just for the SysCache and
friends, or does it cache everything that moves through PostgreSQL?

The big issue is that when we do a write, we don't wait for it
to get to disk.

Only in the case when fsync() is turned off, but again, that's up to
the OS to manage that can of worms, which I think BSD takes care of
that. From conf/NOTES:

Nope. When you don't have a kernel buffer cache, and you do a
write, where do you expect it to go? I assume it goes to the drive,
and you have to wait for that.

Correct, a write call blocks until the bits hit the disk in the
absence of lack of enough buffer space. In the event of enough
buffer, however, the buffer houses the bits until written to disk and
the kernel returns control to the userland app.

Consencus is that FreeBSD does the right thing and hands back data
from the FS buffer even though the fd was marked O_DIRECT (see
bottom).

I don't see how this'd be an issue as buffers populated via a
read(), that are updated, and then written out, would occupy a new
chunk of disk to satisfy MVCC. Why would we need to mark a buffer
as read only and carry around/check its state?

We update the expired flags on the tuple during update/delete.

*nods* Okay, I don't see where the problem would be then with
O_DIRECT. I'm going to ask Dillion about O_DIRECT since he
implemented it, likely for the backplane database that he's writing.
I'll let 'ya know what he says.

-sc

Here's a snip from the conv I had with someone that has mega vfs foo
in FreeBSD:

17:58 * seanc has a question about O_DIRECT
17:58 <@zb^3> ask
17:59 <@seanc> assume two procs have a file open, one proc writes using
buffered IO, the other uses O_DIRECT to read from the file, is
read() smart enough to hand back the data in the buffer that
hasn't hit the disk yet or will there be syncing issues?
18:00 <@zb^3> O_DIRECT in the incarnation from matt dillon will break shit
18:00 <@zb^3> basically, any data read will be set non-cacheable
18:01 <@zb^3> and you'll experience writes earlier than you should
18:01 <@seanc> zb^3: hrm, I don't want to write to the fd + O_DIRECT though
18:02 <@seanc> zb^3: basically you're saying an O_DIRECT fd doesn't consult the
FS cache before reading from disk?
18:03 <@zb^3> no, it does
18:03 <@zb^3> but it immediately puts any read blocks on the ass end of the LRU
18:03 <@zb^3> so if you write a block, then read it with O_DIRECT it will get
written out early :(
18:04 <@seanc> zb^3: ah, got it... it's not a data coherency issue, it's a
priority issue and O_DIRECT makes writes jump the gun
18:04 <@seanc> got it
18:05 <@seanc> zb^3: is that required in the implementation or is it a bug?
18:06 * seanc is wondering whether or not he should bug dillion about this to
get things working correctly
18:07 <@zb^3> it's a bug in the implementation
18:08 <@zb^3> to fix it you have to pass flags all the way down into the
getblk-like layer
18:08 <@zb^3> and dillon was opposed to that
18:09 <@seanc> zb^3: hrm, thx... I'll go bug him about it now and see what's up
in backplane land

--
Sean Chittenden

#20

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Bruce Momjian (#12)

Re: O_DIRECT in freebsd

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Basically, I think we need free-behind rather than O_DIRECT.

There are two separate issues here --- one is what's happening in our
own cache, and one is what's happening in the kernel disk cache.
Implementing our own free-behind code would help in our own cache but
does nothing for the kernel cache.

My thought on this is that for large seqscans we could think about
doing reads through a file descriptor that's opened with O_DIRECT.
But writes should never go through O_DIRECT. In some scenarios this
would mean having two FDs open for the same relation file. This'd
require moderately extensive changes to the smgr-related APIs, but
it doesn't seem totally out of the question. I'd kinda like to see
some experimental evidence that it's worth doing though. Anyone
care to make a quick-hack prototype and do some measurements?

regards, tom lane

#21

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Bruce Momjian (#14)

#22