Safe/Fast I/O ...
Has anyone looked into this? I'm just getting ready to download it and
play with it, see what's involved in using it. From what I can see, its
essentially an optimized stdio library...
URL is at: http://www.research.att.com/sw/tools/sfio
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
Has anyone looked into this? I'm just getting ready to download it and
play with it, see what's involved in using it. From what I can see, its
essentially an optimized stdio library...URL is at: http://www.research.att.com/sw/tools/sfio
Using mmap and/or AIO would be better...
FreeBSD and Solaris support AIO I believe. Given past trends Linux will
as well.
/*
Matthew N. Dodd | A memory retaining a love you had for life
winter@jurai.net | As cruel as it seems nothing ever seems to
http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/
On Sun, 12 Apr 1998, Matthew N. Dodd wrote:
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
Has anyone looked into this? I'm just getting ready to download it and
play with it, see what's involved in using it. From what I can see, its
essentially an optimized stdio library...URL is at: http://www.research.att.com/sw/tools/sfio
Using mmap and/or AIO would be better...
FreeBSD and Solaris support AIO I believe. Given past trends Linux will
as well.
I hate to have to ask, but how is MMAP or AIO better then sfio? I
haven't had enough time to research any of this, and am just starting to
look at it...
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
I hate to have to ask, but how is MMAP or AIO better then sfio? I
haven't had enough time to research any of this, and am just starting to
look at it...
If its simple to compile and works as a drop in replacement AND is faster,
I see no reason why PostgreSQL shouldn't try to link with it.
Keep in mind though that in order to use MMAP or AIO you'd be
restructuring the code to be more efficient rather than doing more of the
same old thing but optimized.
Only testing will prove me right or wrong though. :)
/*
Matthew N. Dodd | A memory retaining a love you had for life
winter@jurai.net | As cruel as it seems nothing ever seems to
http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/
On Sun, 12 Apr 1998, Matthew N. Dodd wrote:
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
I hate to have to ask, but how is MMAP or AIO better then sfio? I
haven't had enough time to research any of this, and am just starting to
look at it...If its simple to compile and works as a drop in replacement AND is faster,
I see no reason why PostgreSQL shouldn't try to link with it.
That didn't really answer the question :(
Keep in mind though that in order to use MMAP or AIO you'd be
restructuring the code to be more efficient rather than doing more of the
same old thing but optimized.
I don't know anything about AIO, so if you can give me a pointer
to where I can read up on it, please do...
...but, with MMAP, unless I'm mistaken, you'd essentially be
reading the file(s) into memory and then manipulating the file(s) there.
Which means one helluva large amount of RAM being required...no?
Using stdio vs sfio, to read a 1.2million line file, the time to
complete goes from 7sec to 5sec ... that makes for a substantial savings
in time, if its applicable.
the problem, as I see it right now, is the docs for it suck ...
so, right now, I'm fumbling through figuring it all out :)
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Marc G. Fournier wrote:
On Sun, 12 Apr 1998, Matthew N. Dodd wrote:
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
I hate to have to ask, but how is MMAP or AIO better then sfio? I
haven't had enough time to research any of this, and am just starting to
look at it...If its simple to compile and works as a drop in replacement AND is faster,
I see no reason why PostgreSQL shouldn't try to link with it.That didn't really answer the question :(
Keep in mind though that in order to use MMAP or AIO you'd be
restructuring the code to be more efficient rather than doing more of the
same old thing but optimized.I don't know anything about AIO, so if you can give me a pointer
to where I can read up on it, please do......but, with MMAP, unless I'm mistaken, you'd essentially be
reading the file(s) into memory and then manipulating the file(s) there.
Which means one helluva large amount of RAM being required...no?Using stdio vs sfio, to read a 1.2million line file, the time to
complete goes from 7sec to 5sec ... that makes for a substantial savings
in time, if its applicable.the problem, as I see it right now, is the docs for it suck ...
so, right now, I'm fumbling through figuring it all out :)
One of the options when building perl5 is to use sfio instead of stdio. I
haven't tried it, but they seem to think it works.
That said, The only place I see this helping pgsql is in copyin and copyout
as these use the stdio: fread(), fwrite(), etc interfaces.
Everywhere else we use the system call IO interfaces: read(), write(),
recv(), send(), select() etc, and do our own buffering.
My prediction is that sfio vs stdio will have undetectable performance
impact on sql performance and only very minor impact on copyin, copyout (as
most of the overhead is in pgsql, not libc).
As far as IO, the problem we have is fsync(). To get rid of it means doing
a real writeahead log system and (maybe) aio to the log. As soon as we get
real logging then we don't need to force datapages out so we can get rid
of all the fsync and (given how slow we are otherwise) completely eliminate
IO as a bottleneck.
Pgsql was built for comfort, not for speed. Fine tuning and code
tweeking and microoptimization is fine as far as it goes. But there is
probably a maximum 2x speed up to be had that way. Total.
We need a 10x speedup to play with serious databases. This will take real
architectural changes.
If you are interested in what is necessary, I highly recommend the book
"Transaction Processing" by Jim Gray (and someone whose name escapes me
just now). It is a great big thing and will take a while to get through, but
is is decently written and very well worth the time. It pretty much gives
away the whole candy store as far as building high performance, reliable,
and scalable database and TP systems. I wish it had been available 10
years ago when I got into the DB game.
-dg
David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
I hate to have to ask, but how is MMAP or AIO better then sfio? I
haven't had enough time to research any of this, and am just starting to
look at it...If its simple to compile and works as a drop in replacement AND is faster,
I see no reason why PostgreSQL shouldn't try to link with it.Keep in mind though that in order to use MMAP or AIO you'd be
restructuring the code to be more efficient rather than doing more of the
same old thing but optimized.Only testing will prove me right or wrong though. :)
As David Gould mentioned, we need to do pre-fetching of data pages
somehow.
When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.
That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)
If you are interested in what is necessary, I highly recommend the book
"Transaction Processing" by Jim Gray (and someone whose name escapes me
just now). It is a great big thing and will take a while to get through, but
is is decently written and very well worth the time. It pretty much gives
away the whole candy store as far as building high performance, reliable,
and scalable database and TP systems. I wish it had been available 10
years ago when I got into the DB game.
David is 100% correct here. We need major overhaul.
He is also 100% correct about the book he is recommending. I got it
last week, and was going to make a big pitch for this, but now that he
has mentioned it again, let me support it. His quote:
It pretty much gives away the whole candy store...
is right on the mark. This book is big, and meaty. Date has it listed
in his bibliography, and says:
If any computer science text ever deserved the epithet "instant
classic," it is surely this one. Its size is daunting at first(over
1000 pages), but the authors display an enviable lightness of touch that
makes even the driest aspects of the subject enjoyable reading. In
their preface, they state their intent as being "to help...solve real
problems"; the book is "pragmatic, covering basic transaction issues in
considerable detail"; and the presentation "is full of code fragments
showing...basic algorithm and data structures" and is not
"encyclopedic." Despite this last claim, the book is (not surprisingly)
comprehensive, and is surely destined to become the standard work.
Strongly recommended.
What more can I say. I will add this book recommendation to
tools/FAQ_DEV. The book is not cheap, at ~$90.
The book is "Transaction Processing: Concepts and Techniques," by Jim
Gray and Andreas Reuter, Morgan Kaufmann publishers, ISBN 1-55860-190-2.
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)
As David Gould mentioned, we need to do pre-fetching of data pages
somehow.When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.
I found it. It is an fcntl option. From man fcntl:
O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.
Who else supports this?
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)
When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.
O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.
Now I am questioning this. I am not sure this acually for file i/o, or
only tty i/o.
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)
On Sun, 12 Apr 1998, Bruce Momjian wrote:
As David Gould mentioned, we need to do pre-fetching of data pages
somehow.When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.I found it. It is an fcntl option. From man fcntl:
O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.Who else supports this?
FreeBSD...
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.Now I am questioning this. I am not sure this acually for file i/o, or
only tty i/o.
async file calls:
aio_cancel
aio_error
aio_read
aio_return -- gets status of pending io call
aio_suspend
aio_write
And yes the Gray book is great!
Jordan Henderson
On Sun, 12 Apr 1998, Bruce Momjian wrote:
I found it. It is an fcntl option. From man fcntl:
O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.Who else supports this?
FreeBSD, and NetBSD appearto.
Linux and Solaris appear not to.
I was really speaking of the POSIX 1003.1B AIO/LIO calls when I originally
brought this up. (aio_read/aio_write)
/*
Matthew N. Dodd | A memory retaining a love you had for life
winter@jurai.net | As cruel as it seems nothing ever seems to
http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/
async file calls:
aio_cancel
aio_error
aio_read
aio_return -- gets status of pending io call
aio_suspend
aio_write
Can you elaborate on this? Does it cause a read() to return right away,
and signal when data is ready?
And yes the Gray book is great!
Jordan Henderson
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)
async file calls:
aio_cancel
aio_error
aio_read
aio_return -- gets status of pending io call
aio_suspend
aio_writeCan you elaborate on this? Does it cause a read() to return right away,
and signal when data is ready?
These are posix calls. Many systems support them and they are fairly easy
to emulate (with threads or io processes) on systems that don't. If we
are going to do Async IO, I suggest that we code to the posix interface and
build emulators for the systems that don't have the posix calls.
I think there is an implementation of this for Linux, but it is a separate
package, not part of the base system as far as I know. Of course with Linux
anything you know it didn't do two weeks ago, it will do next week...
Here is the Solaris man page for aio_read() and aio_write:
-dg
-----------------------------------------------------------------------------
SunOS 5.5.1 Last change: 19 Aug 1993 1
aio_read(3R) Realtime Library aio_read(3R)
NAME
aio_read, aio_write - asynchronous read and write operations
SYNOPSIS
cc [ flag ... ] file ... -lposix4 [ library ... ]
#include <aio.h>
int aio_read(struct aiocb *aiocbp);
int aio_write(struct aiocb *aiocbp);
struct aiocb {
int aio_fildes; /* file descriptor */
volatile void *aio_buf; /* buffer location */
size_t aio_nbytes; /* length of transfer
*/
off_t aio_offset; /* file offset */
int aio_reqprio; /* request priority
offset */
struct sigevent aio_sigevent; /* signal number and
offset */
int aio_lio_opcode; /* listio operation */
};
struct sigevent {
int sigev_notify; /* notification mode */
int sigev_signo; /* signal number */
union sigval sigev_value; /* signal value */
};
union sigval {
int sival_int; /* integer value */
void *sival_ptr; /* pointer value */
};
MT-LEVEL
MT-Safe
DESCRIPTION
aio_read() queues an asynchronous read request, and returns
control immediately. Rather than blocking until completion,
the read operation continues concurrently with other
activity of the process.
Upon enqueuing the request, the calling process reads
aiocbp->nbytes from the file referred to by aiocbp->fildes
into the buffer pointed to by aiocbp->aio_buf.
aiocbp->offset marks the absolute position from the begin-
ning of the file (in bytes) at which the read begins.
aio_write() queues an asynchronous write request, and
returns control immediately. Rather than blocking until
completion, the write operation continues concurrently with
other activity of the process.
Upon enqueuing the request, the calling process writes
aiocbp->nbytes from the buffer pointed to by aiocbp-
aio_buf into the file referred to by aiocbp->fildes. If
O_APPEND is set for aiocbp->fildes, aio_write() operations
append to the file in the same order as the calls were made.
If O_APPEND is not set for the file descriptor, then the
write operation will occur at the absolute position from the
beginning of the file plus aiocbp->offset (in bytes).
These asynchronous operations are submitted at a priority
equal to the calling process' scheduling priority minus
aiocbp->aio_reqprio.
aiocb->aio_sigevent defines both the signal to be generated
and how the calling process will be notified upon I/O com-
pletion. If aio_sigevent.sigev_notify is SIGEV_NONE, then
no signal will be posted upon I/O completion, but the error
status and the return status for the operation will be set
appropriately. If aio_sigevent.sigev_notify is
SIGEV_SIGNAL, then the signal specified in
aio_sigevent.sigev_signo will be sent to the process. If
the SA_SIGINFO flag is set for that signal number, then the
signal will be queued to the process and the value specified
in aio_sigevent.sigev_value will be the si_value component
of the generated signal (see siginfo(5)).
RETURN VALUES
If the I/O operation is successfully queued, aio_read() and
aio_write() return 0, otherwise, they return -1, and set
errno to indicate the error condition. aiocbp may be used
as an argument to aio_error(3R) and aio_return(3R) in order
to determine the error status and the return status of the
asynchronous operation while it is proceeding.
ERRORS
EAGAIN The requested asynchronous I/O operation was
not queued due to system resource limita-
tions.
ENOSYS aio_read() or aio_write() is not supported by
this implementation.
EBADF If the calling function is aio_read(), and
aiocbp->fildes is not a valid file descriptor
open for reading. If the calling function is
aio_write(), and aiocbp->fildes is not a
valid file descriptor open for writing.
EINVAL The file offset value implied by aiocbp->aio_offset
would be invalid,
aiocbp->aio_reqprio is not a valid value,
or aiocbp->aio_nbytes is an invalid value.
ECANCELED The requested I/O was canceled before the I/O
completed due to an explicit aio_cancel(3R)
request.
EINVAL The file offset value implied by aiocbp-
aio_offset would be invalid.
SEE ALSO
close(2), exec(2), exit(2), fork(2), lseek(2), read(2),
write(2), aio_cancel(3R), aio_return(3R), lio_listio(3R),
siginfo(5)
NOTES
For portability, the application should set aiocb- >aio_reqprio
to 0.
Applications compiled under Solaris 2.3 and 2.4 and using
POSIX aio must be recompiled to work correctly when Solaris
supports the Asynchronous Input and Output option.
BUGS
In Solaris 2.5, these functions always return - 1 and set
errno to ENOSYS, because this release does not support the
Asynchronous Input and Output option. It is our intention
Bruce Momjian wrote:
On Sun, 12 Apr 1998, The Hermit Hacker wrote:
I hate to have to ask, but how is MMAP or AIO better then sfio? I
haven't had enough time to research any of this, and am just starting to
look at it...If its simple to compile and works as a drop in replacement AND is faster,
I see no reason why PostgreSQL shouldn't try to link with it.Keep in mind though that in order to use MMAP or AIO you'd be
restructuring the code to be more efficient rather than doing more of the
same old thing but optimized.Only testing will prove me right or wrong though. :)
As David Gould mentioned, we need to do pre-fetching of data pages
somehow.When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.
I have heard the glibc version 2.0 will support the Posix AIO spec.
Solaris currently has AN implementation of AIO, but it is not the
POSIX one. This prefetch could be done in another process or thread,
rather than tying the code to a given AIO implementation.
Ocie
The Hermit Hacker wrote:
...but, with MMAP, unless I'm mistaken, you'd essentially be
reading the file(s) into memory and then manipulating the file(s) there.
Which means one helluva large amount of RAM being required...no?
Not exactly. Memory mapping is used only to map file into some memory
addresses but not put into memory. Disk sectors are copied into memory
on demand. If some mmaped page is accessed - it is copied from disk into
memory.
The main reason of using memory mapping is that you don't have to create
unnecessary buffers. Normally, for every operation you have to create
some in-memory buffer, copy the data there, do some operations, put the
data back into file. In case of memory mapping you may avoid of creating
of unnecessary buffers, and moreover you may call your system functions
less frequently. There are also additional savings. (Less memory
copying, reusing memory if several processes map the same file)
I don't think there exist more efficient solutions.
Mike
--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
[Please forgive me for the way this post is put together; I'm not
actually on your mailing-list, but was just perusing the archives.]
Michal Mosiewicz <mimo@interdata.com.pl> writes:
The main reason of using memory mapping is that you don't have to create
unnecessary buffers. Normally, for every operation you have to create
some in-memory buffer, copy the data there, do some operations, put the
data back into file. In case of memory mapping you may avoid of creating
of unnecessary buffers, and moreover you may call your system functions
less frequently. There are also additional savings. (Less memory
copying, reusing memory if several processes map the same file)
Additionally, if your operating system is at all reasonable, using
memory mapping allows you to take advantage of all the work that has
gone into tuning your VM system. If you map a large file, and then
access in some way that shows reasonable locality, the VM system will
probably be able to do a better job of page replacement on a
system-wide basis than you could do with a cache built into your
application. (A good system will also provide other benefits, such as
pre-faulting and extended read ahead.)
Of course, it does have one significant drawback: memory-mapped regions
do not automatically extend when their underlying files do. So, for
interacting with a structure that shows effectively linear access and
growth, asynchronous I/O is more likely to be a benefit, since AIO can
extend a file asynchronously, whereas other mechanisms will block
while the file is being extended. (Depending on the system, this may
not be true for multi-threaded programs.)
-GAWollman
--
Garrett A. Wollman | O Siem / We are all family / O Siem / We're all the same
wollman@lcs.mit.edu | O Siem / The fires of freedom
Opinions not those of| Dance in the burning flame
MIT, LCS, CRS, or NSA| - Susan Aglukark and Chad Irschick
Import Notes
Resolved by subject fallback
Date: Mon, 13 Apr 1998 12:26:59 -0400 (EDT)
From: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
Subject: Re: [HACKERS] Safe/Fast I/O ...[Please forgive me for the way this post is put together; I'm not
actually on your mailing-list, but was just perusing the archives.]
if your operating system is at all reasonable, using
memory mapping allows you to take advantage of all the work that has
gone into tuning your VM system. If you map a large file, and then
access in some way that shows reasonable locality, the VM system will
probably be able to do a better job of page replacement on a
system-wide basis than you could do with a cache built into your
application.
not necessarily. in this case, the application (the database) has
several very different page access patterns, some of which (e.g.,
non-boustrophedonic nested-loops join, index leaf accesses) *do not*
exhibit reasonable locality and therefore benefit from the ability to
turn on hate-hints or MRU paging on a selective basis. database query
processing is one of the classic examples why "one size does not fit
all" when it comes to page replacement -- no amount of "tuning" of an
LRU/clock algorithm will help if the access pattern is wrong.
stonebraker's 20-year-old CACM flame on operating system services for
databases has motivated a lot of work, e.g., microkernel external
pagers and the more recent work at stanford and princeton on
application-specific paging, but many older VM systems still don't
have a working madvise().. meaning that a *portable* database still
has to implement its own buffer cache if it wants to exploit its
application-specific paging behavior.
--
Paul M. Aoki | University of California at Berkeley
aoki@CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776
| Berkeley, CA 94720-1776
Import Notes
Reply to msg id not found: YourmessageofTue14Apr1998112802-0400199804141528.LAA09003@hub.org | Resolved by subject fallback
While having some spare two hours I was just looking at the current code
of postgres. I was trying to estimate how would it fit to the current
postgres guts.
Finally I've found more proofs that memory mapping would do a lot to
current performance, but I must admit that current storage manager is
pretty read/write oriented. It would be easier to integrate memory
mapping into buffer manager. Actually buffer manager role is to map some
parts of files into memory buffers. However it takes a lot to get
through several layers (smgr and finally md).
I noticed that one of the very important features of mmaping is that you
can sync the buffer (even some part of it), not the whole file. So if
there would be some kind of page level locking, it would be absolutly
necessary to make sure that only committed pages are synced and we don't
overload the IO with unfinished things.
Also, I think that there is no need to create buffers in shared memory.
I have just tested that if you map files with MAP_SHARED attribute set,
then each proces is working on exactly the same copy of memory.
I have also noticed more interesting things, maybe somebody would
clarify on that since I'm not so literate with mmaping. First thing I
was wondering about was how would we deal with open descriptor limits if
we use direct buffer-to-file mappings. While currently buffers are
isolated from files it's possible to close some descriptors without
throwing buffers. However it seems (tried it) that memory mapping works
even after a file descriptor is closed. So, is this possible to cross
the limit of open files by using memory mapping? Or maybe the descriptor
remains open until munmap call? Or maybe it's just a Linux feature?
Mike
--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
Michal Mosiewicz wrote:
While having some spare two hours I was just looking at the current code
of postgres. I was trying to estimate how would it fit to the current
postgres guts.Finally I've found more proofs that memory mapping would do a lot to
current performance, but I must admit that current storage manager is
pretty read/write oriented. It would be easier to integrate memory
mapping into buffer manager. Actually buffer manager role is to map some
parts of files into memory buffers. However it takes a lot to get
through several layers (smgr and finally md).I noticed that one of the very important features of mmaping is that you
can sync the buffer (even some part of it), not the whole file. So if
there would be some kind of page level locking, it would be absolutly
necessary to make sure that only committed pages are synced and we don't
overload the IO with unfinished things.Also, I think that there is no need to create buffers in shared memory.
I have just tested that if you map files with MAP_SHARED attribute set,
then each proces is working on exactly the same copy of memory.
This means that the processes can share the memory, but these pages
must be explicitly mapped in the other process before it can get to
them and must be explicitly unmapped from all processes before the
memory is freed up.
It seems like there are basically two ways we could use this.
1) mmap in all files that might be used and just access them directly.
2) mmap in pages from files as they are needed and munmap the pages
out when they are no longer needed.
#1 seems easier, but it does limit us to 2gb databases on 32 bit
machines.
#2 could be done by having a sort of mmap helper. As soon as process
A knows that it will need (might need?) a given page from a given
file, it communicates this to another process B, which attempts to
create a shared mmap for that page. When process A actually needs to
use the page, it uses the real mmap, which should be fast if process B
has already mapped this page into memory.
Other processes could make use of this mapping (following proper
locking etiquette), each making their request to B, which simply
increments a counter on that mapping for each request after the first
one. When a process is done with one of these mappings, it unmaps the
page itself, and then tells B that it is done with the page. When B
sees that the count on this page has gone to zero, it can either
remove its own map, or retain it in some sort of cache in case it is
requested again in the near future. Either way, when B figures the
page is no longer being used, it unmaps the page itself.
This mapping might get synced by the OS at unknown intervals, but
processes can sync the pages themselves, say at the end of a
transaction.
Ocie
On Wed, 15 Apr 1998, Michal Mosiewicz wrote:
isolated from files it's possible to close some descriptors without
throwing buffers. However it seems (tried it) that memory mapping works
even after a file descriptor is closed. So, is this possible to cross
the limit of open files by using memory mapping? Or maybe the descriptor
remains open until munmap call? Or maybe it's just a Linux feature?
Nope, thats how it works.
A good friend of mine used this in some modifications to INN (probably in
INN -current right now).
Sending an article involved opening the file, mmapping it, closing the fd,
writing the mapped area and munmap-ing.
Its pretty slick.
Be careful of the file changing under you.
/*
Matthew N. Dodd | A memory retaining a love you had for life
winter@jurai.net | As cruel as it seems nothing ever seems to
http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/
While having some spare two hours I was just looking at the current code
of postgres. I was trying to estimate how would it fit to the current
postgres guts.Finally I've found more proofs that memory mapping would do a lot to
current performance, but I must admit that current storage manager is
pretty read/write oriented. It would be easier to integrate memory
mapping into buffer manager. Actually buffer manager role is to map some
parts of files into memory buffers. However it takes a lot to get
through several layers (smgr and finally md).I noticed that one of the very important features of mmaping is that you
can sync the buffer (even some part of it), not the whole file. So if
there would be some kind of page level locking, it would be absolutly
necessary to make sure that only committed pages are synced and we don't
overload the IO with unfinished things.
We really don't need to worry about it. Our goal it to control flushing
of pg_log to disk. If we control that, we don't care if the non-pg_log
pages go to disk. In a crash, any non-synced pg_log transactions are
rolled-back.
We are spoiled because we have just one compact central file to worry
about sync-ing.
Also, I think that there is no need to create buffers in shared memory.
I have just tested that if you map files with MAP_SHARED attribute set,
then each proces is working on exactly the same copy of memory.I have also noticed more interesting things, maybe somebody would
clarify on that since I'm not so literate with mmaping. First thing I
was wondering about was how would we deal with open descriptor limits if
we use direct buffer-to-file mappings. While currently buffers are
isolated from files it's possible to close some descriptors without
throwing buffers. However it seems (tried it) that memory mapping works
even after a file descriptor is closed. So, is this possible to cross
the limit of open files by using memory mapping? Or maybe the descriptor
remains open until munmap call? Or maybe it's just a Linux feature?
Not sure about this, but the open file limit is not a restriction for us
very often, it is. It is a per-backend issue, and I can't imagine cases
where a backend has more than 64 file descriptors open. If so, you can
increase the kernel limits, usually.
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)
As David Gould mentioned, we need to do pre-fetching of data pages
somehow.When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.I found it. It is an fcntl option. From man fcntl:
O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.Who else supports this?
under Irix:
man fcntl:
F_SETFL Set file status flags to the third argument, arg, taken as an
object of type int. Only the following flags can be set [see
fcntl(5)]: FAPPEND, FSYNC, FNDELAY, FNONBLK, FDIRECT, and
FASYNC. Since arg is used as a bit vector to set the flags,
values for all the flags must be specified in arg. (Typically,
arg may be constructed by obtaining existing values by F_GETFL
and then changing the particular flags.) FAPPEND is equivalent
to O_APPEND; FSYNC is equivalent to O_SYNC; FNDELAY is
equivalent to O_NDELAY; FNONBLK is equivalent to O_NONBLOCK;
and FDIRECT is equivalent to O_DIRECT. FASYNC is equivalent to
calling ioctl with the FIOASYNC command (except that with ioctl
all flags need not be specified). This enables the SIGIO
facilities and is currently supported only on sockets.
....but then I can find no details of FIOASYNC on the ioctl page or pages
referenced therein.
Andrew
----------------------------------------------------------------------------
Dr. Andrew C.R. Martin University College London
EMAIL: (Work) martin@biochem.ucl.ac.uk (Home) andrew@stagleys.demon.co.uk
URL: http://www.biochem.ucl.ac.uk/~martin
Tel: (Work) +44(0)171 419 3890 (Home) +44(0)1372 275775
Import Notes
Resolved by subject fallback
As David Gould mentioned, we need to do pre-fetching of data pages
somehow.When doing a sequential scan on a table, the OS is doing a one-page
prefetch, which is probably enough. The problem is index scans of the
table. Those are not sequential in the main heap table (unless it is
clustered on the index), so a prefetch would help here a lot.That is where we need async i/o. I am looking in BSDI, and I don't see
any way to do async i/o. The only way I can think of doing it is via
threads.I found it. It is an fcntl option. From man fcntl:
O_ASYNC Enable the SIGIO signal to be sent to the process group when
I/O is possible, e.g., upon availability of data to be read.Who else supports this?
under Irix:
man fcntl:
F_SETFL Set file status flags to the third argument, arg, taken as an
object of type int. Only the following flags can be set [see
fcntl(5)]: FAPPEND, FSYNC, FNDELAY, FNONBLK, FDIRECT, and
FASYNC. Since arg is used as a bit vector to set the flags,
values for all the flags must be specified in arg. (Typically,
arg may be constructed by obtaining existing values by F_GETFL
and then changing the particular flags.) FAPPEND is equivalent
to O_APPEND; FSYNC is equivalent to O_SYNC; FNDELAY is
equivalent to O_NDELAY; FNONBLK is equivalent to O_NONBLOCK;
and FDIRECT is equivalent to O_DIRECT. FASYNC is equivalent to
calling ioctl with the FIOASYNC command (except that with ioctl
all flags need not be specified). This enables the SIGIO
facilities and is currently supported only on sockets.....but then I can find no details of FIOASYNC on the ioctl page or pages
referenced therein.
I have found BSDI does not support async i/o. You need a separate
process to do the i/o. The O_ASYNC flag only works on tty files.
--
Bruce Momjian | 830 Blythe Avenue
maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026
+ If your life is a hard drive, | (610) 353-9879(w)
+ Christ can be your backup. | (610) 853-3000(h)