mmap and MAP_ANON

Started by Bruce Momjianover 27 years ago27 messages
#1Bruce Momjian
maillist@candle.pha.pa.us

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#2Noname
dj@pelf.harvard.edu
In reply to: Bruce Momjian (#1)
Re: [HACKERS] mmap and MAP_ANON

I can't find MAP_ANON on Solaris 2.5.1 or 2.5.6. The man
page claims the following options are avaliable:

MAP_SHARED Share changes.
MAP_PRIVATE Changes are private.
MAP_FIXED Interpret addr exactly.
MAP_NORESERVE Don't reserve swap space.

If you'd like, I can send along the whole man page.

--------- Received message begins Here ---------

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
+  If your life is a hard drive,     |  (610) 353-9879(w)
+  Christ can be your backup.        |  (610) 853-3000(h)

-------------
Diab Jerius Harvard-Smithsonian Center for Astrophysics
60 Garden St, MS 70, Cambridge MA 02138 USA
djerius@cfa.harvard.edu vox: 617 496 7575 fax: 617 495 7356

#3Göran Thyni
goran@bildbasen.se
In reply to: Bruce Momjian (#1)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

SVR4 (at least older ones) does not support MMAP_ANON,
but the recommended in W. Richards Stevens'
"Advanced programming in the Unix environment" (aka the Bible part 2)
is to use /dev/zero.

This should be configurable with autoconf:

<PSEUDO CODE>

if (exists MAP_ANON) use it; else use /dev/zero

------------

flags = MAP_SHARED;
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);

</PSEUDO CODE>

regards,
--
---------------------------------------------
G�ran Thyni, sysadm, JMS Bildbasen, Kiruna

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Göran Thyni (#3)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

On HPUX it seems to be spelled MAP_ANONYMOUS. At least if this means
the same thing as what you are talking about. The HP man page says

: The MAP_FILE and MAP_ANONYMOUS flags control whether the region to be
: mapped is a mapped file region or an anonymous shared memory region.
: Exactly one of these flags must be selected.

regards, tom lane

#5Andrew Martin
martin@biochemistry.ucl.ac.uk
In reply to: Tom Lane (#4)
Re: [HACKERS] mmap and MAP_ANON

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

IRIX doesn't seem to have it (checked bith Irix 5 and Irix 6)

Andrew

----------------------------------------------------------------------------
Dr. Andrew C.R. Martin University College London
EMAIL: (Work) martin@biochem.ucl.ac.uk (Home) andrew@stagleys.demon.co.uk
URL: http://www.biochem.ucl.ac.uk/~martin
Tel: (Work) +44(0)171 419 3890 (Home) +44(0)1372 275775

#6Noname
ocie@paracel.com
In reply to: Bruce Momjian (#1)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel). As another poster
commented, /dev/zero can be mapped for anonymous memory.

Ocie Mitchell

#7Göran Thyni
goran@bildbasen.se
In reply to: Bruce Momjian (#1)
Re: [HACKERS] mmap and MAP_ANON

G�ran Thyni wrote:

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

SVR4 (at least older ones) does not support MMAP_ANON,
but the recommended in W. Richards Stevens'
"Advanced programming in the Unix environment" (aka the Bible part 2)
is to use /dev/zero.

This should be configurable with autoconf:

<PSEUDO CODE>

if (exists MAP_ANON) use it; else use /dev/zero

------------

flags = MAP_SHARED;
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);

</PSEUDO CODE>

Ouch, hate to say this but:
I played around with this last night and
I can't get either of the above technics to work with Linux 2.0.33

I will try it with the upcoming 2.2,
but for now, we can't loose shmem without loosing
a large part of the users (including some developers).
flags = MAP_SHARED;

<PSEUDO CODE>
#ifdef HAS_WORKING_MMAP
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
#else
id = shget(...);
area = shmat(...);
#endif
</PSEUDO CODE>

not happy,
--
---------------------------------------------
G�ran Thyni, sysadm, JMS Bildbasen, Kiruna

#8Hannu Krosing
hannu@trust.ee
In reply to: Göran Thyni (#7)
Re: [HACKERS] mmap and MAP_ANON

ocie@paracel.com wrote:

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel). As another poster
commented, /dev/zero can be mapped for anonymous memory.

although 'man mmap' does not say it, it is present in sys/mman.h on
linux (at least 2.0.33)

it is NOT present in Solaris x86 v.2.6

it is NOT present in SINIX v5.42 (UNIX(r) System V Release 4.1)

--------------
Hannu Krosing

#9Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#6)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel). As another poster
commented, /dev/zero can be mapped for anonymous memory.

OK, who doesn't have /dev/zero?

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#10Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Göran Thyni (#7)
Re: [HACKERS] mmap and MAP_ANON

G���ran Thyni wrote:

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

SVR4 (at least older ones) does not support MMAP_ANON,
but the recommended in W. Richards Stevens'
"Advanced programming in the Unix environment" (aka the Bible part 2)
is to use /dev/zero.

This should be configurable with autoconf:

<PSEUDO CODE>

if (exists MAP_ANON) use it; else use /dev/zero

------------

flags = MAP_SHARED;
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);

</PSEUDO CODE>

Ouch, hate to say this but:
I played around with this last night and
I can't get either of the above technics to work with Linux 2.0.33

I will try it with the upcoming 2.2,
but for now, we can't loose shmem without loosing
a large part of the users (including some developers).
flags = MAP_SHARED;

<PSEUDO CODE>
#ifdef HAS_WORKING_MMAP
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
#else
id = shget(...);
area = shmat(...);
#endif
</PSEUDO CODE>

What exactly did not work?

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#11Noname
ocie@paracel.com
In reply to: Hannu Krosing (#8)
Re: [HACKERS] mmap and MAP_ANON

Hannu Krosing wrote:

ocie@paracel.com wrote:

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel). As another poster
commented, /dev/zero can be mapped for anonymous memory.

although 'man mmap' does not say it, it is present in sys/mman.h on
linux (at least 2.0.33)

It appears there, but using it causes mmap to return EINVAL.

Ocie

#12Noname
ocie@paracel.com
In reply to: Bruce Momjian (#9)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel). As another poster
commented, /dev/zero can be mapped for anonymous memory.

OK, who doesn't have /dev/zero?

I have been playing around with mmap on Linux. I have been unable to
mmap /dev/zero or to use MAP_ANON in conjunction with MAP_SHARED.
There is no problem sharing memory when a real file is used.
Solaris-sparc seems to have no trouble sharing memory mapped from
/dev/zero. Very strange.

Ocie

#13Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#12)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

Doesn't seem to appear in Linux (2.0.30 kernel). As another poster
commented, /dev/zero can be mapped for anonymous memory.

OK, who doesn't have /dev/zero?

I have been playing around with mmap on Linux. I have been unable to
mmap /dev/zero or to use MAP_ANON in conjunction with MAP_SHARED.
There is no problem sharing memory when a real file is used.
Solaris-sparc seems to have no trouble sharing memory mapped from
/dev/zero. Very strange.

And very bad. We have to have a 100% usable solution, or have some if
ANON code, else shared memory.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#14Göran Thyni
goran@bildbasen.se
In reply to: Bruce Momjian (#10)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

G�ran Thyni wrote:

Ouch, hate to say this but:
I played around with this last night and
I can't get either of the above technics to work with Linux 2.0.33

I will try it with the upcoming 2.2,
but for now, we can't loose shmem without loosing
a large part of the users (including some developers).

<PSEUDO CODE>
#ifdef HAS_WORKING_MMAP
flags = MAP_SHARED;
#ifdef HAS_MMAP_ANON
fd = -1;
flags |= MAP_ANON;
#else
fd = open('/dev/zero, O_RDWR);
#endif
area = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, 0);
#else
id = shget(...);
area = shmat(...);
#endif
</PSEUDO CODE>

What exactly did not work?

OK, here's the story:

Linux can only MAP_SHARED if the file is a *real* file,
devices or trick like MAP_ANON does only work with MAP_PRIVATE.

2.1.101 does not work either which means 2.2 will probably not
implement this feature (feature freeze i in effect for 2.2).

*But*,
(I was thinking about this,)
we should IMHO take a step backwards to get a better view
over the whole memory subsystem.
- Why and for what is shared memory used in the first place?
- Could we use mmap:ing of files at a higher level then
src/backend/strorage/ipc/ipc.c to get even better performance
and cleaness?

I will, time permitting, look into cleaning up the shmem-init/exit
routines
to work in a "no-exec" environment. I also has a hack to use
mmap-shared/private,
which of course is untested, since it does not work on my linux-boxen.

regards,
--
---------------------------------------------
G�ran Thyni, sysadm, JMS Bildbasen, Kiruna

#15Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Göran Thyni (#14)
Re: [HACKERS] mmap and MAP_ANON

*But*,
(I was thinking about this,)
we should IMHO take a step backwards to get a better view
over the whole memory subsystem.
- Why and for what is shared memory used in the first place?
- Could we use mmap:ing of files at a higher level then
src/backend/strorage/ipc/ipc.c to get even better performance
and cleaness?

Yes, we could use mmap() to map the actual files. I will post time
timings on this soon.

The shared memory acts as a cache for us, that can be locked and not
read in/out of the address space for each sharing, like it does when we
use the OS buffer cache.

I will, time permitting, look into cleaning up the shmem-init/exit
routines
to work in a "no-exec" environment. I also has a hack to use
mmap-shared/private,
which of course is untested, since it does not work on my linux-boxen.

regards,
--
---------------------------------------------
G���ran Thyni, sysadm, JMS Bildbasen, Kiruna

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#15)
Re: [HACKERS] mmap and MAP_ANON

"G�ran Thyni" <goran@bildbasen.se> writes:

Linux can only MAP_SHARED if the file is a *real* file,
devices or trick like MAP_ANON does only work with MAP_PRIVATE.

Well, this makes some sense: MAP_SHARED implies that the shared memory
will also be accessible to independently started processes, and
to do that you have to have an openable filename to refer to the
data segment by.

MAP_PRIVATE will *not* work for our purposes: according to my copy
of mmap(2):

: If MAP_PRIVATE is set in flags:
: o Modification to the mapped region by the calling process is
: not visible to other processes which have mapped the same
: region using either MAP_PRIVATE or MAP_SHARED.
: Modifications are not visible to descendant processes that
: have inherited the mapped region across a fork().

so privately mapped segments are useless for interprocess communication,
even after we get rid of exec().

mmaping /dev/zero, as has been suggested earlier in this thread,
seems like a really bad idea to me. Would that not imply that
any process anywhere in the system that also decides to mmap /dev/zero
would get its hands on the Postgres shared memory segment? You
can't restrict permissions on /dev/zero to prevent it.

Am I right in thinking that the contents of the shared memory segment
do not need to outlive a particular postmaster run? (If they do, then
we have to mmap a real file anyway.) If so, then MAP_ANON(YMOUS) is
a reasonable solution on systems that support it. On those that
don't support it, we will have to mmap a real file owned by (and only
readable/writable by) the postgres user. Time for another configure
test.

BTW, /dev/zero doesn't exist anyway on HPUX 9.

regards, tom lane

#17Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#16)
Re: [HACKERS] mmap and MAP_ANON

"G���ran Thyni" <goran@bildbasen.se> writes:

Linux can only MAP_SHARED if the file is a *real* file,
devices or trick like MAP_ANON does only work with MAP_PRIVATE.

Well, this makes some sense: MAP_SHARED implies that the shared memory
will also be accessible to independently started processes, and
to do that you have to have an openable filename to refer to the
data segment by.

MAP_PRIVATE will *not* work for our purposes: according to my copy
of mmap(2):

Right.

so privately mapped segments are useless for interprocess communication,
even after we get rid of exec().

Yep.

mmaping /dev/zero, as has been suggested earlier in this thread,
seems like a really bad idea to me. Would that not imply that
any process anywhere in the system that also decides to mmap /dev/zero
would get its hands on the Postgres shared memory segment? You
can't restrict permissions on /dev/zero to prevent it.

Good point.

Am I right in thinking that the contents of the shared memory segment
do not need to outlive a particular postmaster run? (If they do, then
we have to mmap a real file anyway.) If so, then MAP_ANON(YMOUS) is
a reasonable solution on systems that support it. On those that
don't support it, we will have to mmap a real file owned by (and only
readable/writable by) the postgres user. Time for another configure
test.

MAP_ANON is the best, because it can be restricted to only postmaster
children.

The problem with using a real file is that the filesystem is going to be
flushing those dirty pages to disk, and that could really hurt
performance.

Actually, when I install Informix, I always have to modify the kernel to
allow a larger amount of SYSV shared memory. Maybe we just need to give
people per-OS instructions on how to do that. Under BSD/OS, I now have
32MB of shared memory, or 3900 8k shared buffers.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#18Noname
ocie@paracel.com
In reply to: Tom Lane (#16)
Re: [HACKERS] mmap and MAP_ANON

Tom Lane wrote:

"G�ran Thyni" <goran@bildbasen.se> writes:

Linux can only MAP_SHARED if the file is a *real* file,
devices or trick like MAP_ANON does only work with MAP_PRIVATE.

Well, this makes some sense: MAP_SHARED implies that the shared memory
will also be accessible to independently started processes, and
to do that you have to have an openable filename to refer to the
data segment by.

MAP_PRIVATE will *not* work for our purposes: according to my copy
of mmap(2):

: If MAP_PRIVATE is set in flags:
: o Modification to the mapped region by the calling process is
: not visible to other processes which have mapped the same
: region using either MAP_PRIVATE or MAP_SHARED.
: Modifications are not visible to descendant processes that
: have inherited the mapped region across a fork().

so privately mapped segments are useless for interprocess communication,
even after we get rid of exec().

mmaping /dev/zero, as has been suggested earlier in this thread,
seems like a really bad idea to me. Would that not imply that
any process anywhere in the system that also decides to mmap /dev/zero
would get its hands on the Postgres shared memory segment? You
can't restrict permissions on /dev/zero to prevent it.

On some systems, mmaping /dev/zero can be shared with child processes
as in this example:

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/wait.h>

int main()
{
int fd;
caddr_t ma;
int i;
int pagesize = sysconf(_SC_PAGESIZE);

fd=open("/dev/zero",O_RDWR);
if (fd==-1) {
perror("open");
exit(1);
}

ma=mmap((caddr_t) 0,
pagesize,
(PROT_READ|PROT_WRITE),
MAP_SHARED,
fd,
0);

if ((int)ma == -1) {
perror("mmap");
exit(1);
}

memset(ma,0,pagesize);

i=fork();

if (i==-1) {
perror("fork");
exit(1);
}

if (i==0) { /* child */
((char*)ma)[0]=1;
sleep(1);
printf("child %d %d\n",((char*)ma)[0],((char*)ma)[1]);
sleep(1);
return 0;
} else { /* parent */
((char*)ma)[1]=1;
sleep(1);
printf("parent %d %d\n",((char*)ma)[0],((char*)ma)[1]);
}

wait(NULL);
munmap(ma,pagesize*10);

return 0;
}

This works on Solaris and as expected, both the parent and child are
able to write into the memory and their changes are honored (the
memory is truely shared between processes. We can certainly map a
real file, and this might even give us some interesting crash recovery
options. The nice thing about doing away with the exec is that the
memory mapped in the parent process is avalible at the same address
region in every process, so we don't have to do funky pointer tricks.

The only problem I see with mmap is that we don't know exactly when a
page will be written to disk. I.E. If you make two writes, the page
might get sync'ed between them, thus storing an inconsistant
intermediate state to the disk. Perhaps with proper transaction
control, this is not a problem.

The question is should the individual database files be mapped into
memory, or should one "pgmem" file be mapped, with pages from
different files read into it. The first option would allow different
backend processes to map different pages of different files as they
are needed. The postmaster could "pre-map" pages on behalf of the
backend processes as sort of an inteligent read-ahead mechanism.

I'll try to write this seperate from Postgres just to see how it works.

Ocie

#19Michal Mosiewicz
mimo@interdata.com.pl
In reply to: Bruce Momjian (#1)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

Well, I haven't noticed this discussion. However, I can't understand one
thing:

Why a lot of people investigate how to replace shared memory with
mmapping anonymously but there is no discussion on replacing
reads/writes with memory mapping of heap files.

This way we would save not only on having better system cache
utilisation but also we would have less memory copying. For me it seems
like a more robust solution. I suggested it few months ago.

If it's a bad idea, I wonder why?
Are there any systems that cannot do mmaps at all?

Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#20Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Michal Mosiewicz (#19)
Re: [HACKERS] mmap and MAP_ANON

Bruce Momjian wrote:

Would people tell me what platforms do NOT support the MAP_ANON flag to
the mmap() system call? You should find it in the mmap() manual page.

*BSD has it, but I am not sure of the others. I am researching cache
size issues and the use of mmap vs. SYSV shared memory.

Well, I haven't noticed this discussion. However, I can't understand one
thing:

Why a lot of people investigate how to replace shared memory with
mmapping anonymously but there is no discussion on replacing
reads/writes with memory mapping of heap files.

This way we would save not only on having better system cache
utilisation but also we would have less memory copying. For me it seems
like a more robust solution. I suggested it few months ago.

If it's a bad idea, I wonder why?
Are there any systems that cannot do mmaps at all?

mmap'ing a file is not necessary faster. I will post time timings soon
that show this is not the case.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#21Andreas Zeugswetter
andreas.zeugswetter@telecom.at
In reply to: Bruce Momjian (#20)
AW: [HACKERS] mmap and MAP_ANON

- Could we use mmap:ing of files at a higher level then
src/backend/strorage/ipc/ipc.c to get even better performance
and cleaness?

Yes, we could use mmap() to map the actual files. I will post time
timings on this soon.

I do not think this will be a practicable solution, since it would mean the whole db
has to mmap'ed. This means there has to be enough virtual memory to hold
the complete database, or at least one table at a time. Or do I understand this wrong ??

Andreas

#22Andreas Zeugswetter
andreas.zeugswetter@telecom.at
In reply to: Andreas Zeugswetter (#21)
Re: [HACKERS] mmap and MAP_ANON

The problem with using a real file is that the filesystem is going to be
flushing those dirty pages to disk, and that could really hurt
performance.

definitely

Actually, when I install Informix, I always have to modify the kernel to
allow a larger amount of SYSV shared memory. Maybe we just need to give
people per-OS instructions on how to do that. Under BSD/OS, I now have
32MB of shared memory, or 3900 8k shared buffers.

This I think would be the best solution. There are actually not that many systems
with too low limits.
AIX: per segment 256Mb max 10 segments per process (AIX 4.3 any number of segments)

Andreas

#23Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Andreas Zeugswetter (#21)
Re: AW: [HACKERS] mmap and MAP_ANON

- Could we use mmap:ing of files at a higher level then
src/backend/strorage/ipc/ipc.c to get even better performance
and cleaness?

Yes, we could use mmap() to map the actual files. I will post time
timings on this soon.

I do not think this will be a practicable solution, since it would mean the whole db
has to mmap'ed. This means there has to be enough virtual memory to hold
the complete database, or at least one table at a time. Or do I understand this wrong ??

We can map parts of the table, even in 8k chunks. However, looking at
my sequential scan timing tests, it would be slower.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#24Göran Thyni
goran@bildbasen.se
In reply to: Andreas Zeugswetter (#21)
Re: AW: [HACKERS] mmap and MAP_ANON

Andreas Zeugswetter wrote:

- Could we use mmap:ing of files at a higher level then
src/backend/strorage/ipc/ipc.c to get even better performance
and cleaness?

Yes, we could use mmap() to map the actual files. I will post time
timings on this soon.

I do not think this will be a practicable solution, since it would mean the whole db
has to mmap'ed. This means there has to be enough virtual memory to hold
the complete database, or at least one table at a time. Or do I understand this wrong ??

Why would we map the whole database or even a whole table?
You can map the section of a file you are interested in.

# man mmap

Besides a sensible memory manager does not actually map the pages
until they are access, unfort. not all OSes are sensible.

--
---------------------------------------------
G�ran Thyni, sysadm, JMS Bildbasen, Kiruna

#25Noname
dg@illustra.com
In reply to: Michal Mosiewicz (#19)
Re: [HACKERS] mmap and MAP_ANON

Michal Mosiewicz asks:

Why a lot of people investigate how to replace shared memory with
mmapping anonymously but there is no discussion on replacing
reads/writes with memory mapping of heap files.

This way we would save not only on having better system cache
utilisation but also we would have less memory copying. For me it seems
like a more robust solution. I suggested it few months ago.

If it's a bad idea, I wonder why?

Unfortunately, it is probably a bad idea.

The postgres buffer cache is a shared pool of pages containing an assortment
of blocks from all the different tables in use by all the different backends.

That is, if backend 'a' is reading table 'ta', and backend 'b' is reading
table 'tb' then the buffer cache will have blocks from both table 'ta'
and table 'tb' in it.

The benefit occurs when backend 'x' starts reading either table 'ta' or 'tb'.
Rather than have to go to disk, it finds the pages already loaded in the
share buffer cache. Likewise, if backend 'a' should modify a page in table
'ta', the change is then visible to all the other backends (ignoring locks
for this discussion) without any explicit communication between the backends.

If we started creating a separate mmapped region for each table several
problems occur:

- each time a backend wants to use a table it will have to somehow find out
if it is already mapped, and then either map it (for the first time), or
attach to an existing mapping created by another backend. This implies
that the backends need to communicate with all the other backends to let
them know what mappings they are using.

- if two backends are using the same table, and the table is too big to
map the whole thing, then each backend needs a "window" into the table.
This becomes difficult if the two backends are using different parts of
the table (ie, the first page and the last page).

- there is a finite amount of memory available on the system for postgres
to use. This will have to be split amoung all the open tables used by
all the backends. If you have 50 backends each using 10 each with 3
indexes, you now need 2,000 mappings in the system. Assuming that there
are 2001 pages available for mapping, how do you decide with table gets
to map 2 pages? How do you get all the backends to agree about this?

Essentially, mapping tables separately creates a requirement for a huge
amount of communication and synchronization amoung the backends. And, even
if this were not prohibitive, it ends up fragmenting the available memory
for buffers so badly that the cacheing becomes ineffective.

So, unless you are going to map whole tables and those tables are needed by
_all_ the active backends the idea of mmapping separate tables is unworkable.

That said, there are tables that meet this criteria, for instance the
transaction logs and anchors. Here mmapping might indeed be useful but even
so it would take some thought and a fair amount of work to gain any benefit.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
and someone who knows less will correct me if I'm right."
--David Palmer (palmer@tybalt.caltech.edu)

#26Michal Mosiewicz
mimo@interdata.com.pl
In reply to: Noname (#25)
Re: [HACKERS] mmap and MAP_ANON

David Gould wrote:

- each time a backend wants to use a table it will have to somehow find out
if it is already mapped, and then either map it (for the first time), or
attach to an existing mapping created by another backend. This implies
that the backends need to communicate with all the other backends to let
them know what mappings they are using.

Why backend has to check if it's already mapped? Let's say that backend
A maps first page from file X using MAP_SHARED, then backend B maps
first page using MAP_SHARED. So, at this moment they are pointing to the
same memory area without any communication. (at least that's the way it
works on Linux, in Linux even MAP_PRIVATE is the same memory region when
you mmap it twice until you write a byte in there - then it's copied).
So, why would we check what other backends map. We use MAP_SHARED to not
have to check it.

- if two backends are using the same table, and the table is too big to
map the whole thing, then each backend needs a "window" into the table.
This becomes difficult if the two backends are using different parts of
the table (ie, the first page and the last page).

Well I wasn't even thinking on mapping anything more than just one page
that is needed.

- there is a finite amount of memory available on the system for postgres
to use. This will have to be split amoung all the open tables used by
all the backends. If you have 50 backends each using 10 each with 3
indexes, you now need 2,000 mappings in the system. Assuming that there
are 2001 pages available for mapping, how do you decide with table gets
to map 2 pages? How do you get all the backends to agree about this?

IMHO, this is also not that much problem as it looks like. When the
system is running out of virtual memory, the occupied pages are
paged-out. The system does what actually buffer manager does - it writes
down the pages that are dirty, and simply frees memory from those that
are not modified on a last recently used basis. So the only thing that
costs are the memory structures that describe the bindings between disk
blocks and memory. And of course it's sometimes bad to use LRU
algorithm. Sometimes backend knows better which pages are best to
page-out.

I have to admit that this point seems to be potential source of
performance drop-downs and all the backends have to communicate to
prevent it. But I don't think that this communication is huge. Note that
currently all backends use quite large communication channel (256 pages
large by default?) which is hardly used for communication purposes but
rather for storage.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#27Noname
dg@illustra.com
In reply to: Michal Mosiewicz (#26)
Re: [HACKERS] mmap and MAP_ANON

This is all old news, but I am trying to catch up on my hackers mail. This
particular post caught my eye to think carefully about before replying.

Michal Mosiewicz <mimo@interdata.com.pl> writes:

David Gould wrote:

- each time a backend wants to use a table it will have to somehow find out
if it is already mapped, and then either map it (for the first time), or
attach to an existing mapping created by another backend. This implies
that the backends need to communicate with all the other backends to let
them know what mappings they are using.

Why backend has to check if it's already mapped? Let's say that backend
A maps first page from file X using MAP_SHARED, then backend B maps
first page using MAP_SHARED. So, at this moment they are pointing to the
same memory area without any communication. (at least that's the way it
works on Linux, in Linux even MAP_PRIVATE is the same memory region when
you mmap it twice until you write a byte in there - then it's copied).
So, why would we check what other backends map. We use MAP_SHARED to not
have to check it.

- if two backends are using the same table, and the table is too big to
map the whole thing, then each backend needs a "window" into the table.
This becomes difficult if the two backends are using different parts of
the table (ie, the first page and the last page).

Well I wasn't even thinking on mapping anything more than just one page
that is needed.

Your statement about not checking if a file was mapped struck me as a problem
but on second thought, I was thinking about a typical dbms buffer cache,
you are proposing eliminating the dbms buffer cache and using mmap() to read
file pages directly relying on the OS cache. I agree that this could work.

And, at least some OSes have pretty good buffer management and quick
mmap() calls. Linux 2.1.101 seems to be able to do a mmap() in 25 usec on
a P166 according to lmbench, BSD and Solaris are quite a bit slower, and
at the really slow end, IRIX and HPUX take hundreds of usec for mmap()).

But even given good OS mmap() and buffer management, there may still be
a performance justification for a separate DBMS buffer cache.

Suppose many backends are sharing a small table eg a lookup table with a
few dozen rows, perhaps three pages worth. Suppose that most queries
scan this table several times (eg multiple joins and subqueries). And
suppose most backends run several queries before being restarted.

This gives the situation where all the backends refer to same two or three
pages hundreds or thousands of times each.

In the traditional dbms buffer cache, the first backend to scan the table
does say three reads(), and each backend does one mmap() at startup time
to map the buffer cache. This means that a very few system calls suffice
for thousands of accesses to the shared table.

Your proposal, if I have understood it, has one page mmapped() for the table
by each backend. To get the next page another mmap() has to be done. This
results in three mmaps() per scan for each backend. So, even though the
table is fully cached by the OS, thousands of system calls are needed to
service all the scans. Even on systems with very fast mmap() I think this
may be a significant overhead.

That is, there may be a reason all the highend dbms's use their own buffer
caches.

If you are interested, this could be tested with not too much work. Simply
instrument the buffer manager to trace buffer lookups, and read()s, and
write()s and log this to a file. Then write a simple program to run the
trace file performing the same operations only using mmap(). Try to get
a trace from a busy web site or other heavy duty application using postgres.
I think that this will show that the buffer cache has its place in life.
But, I am prepared to hear otherwise.

- there is a finite amount of memory available on the system for postgres
to use. This will have to be split amoung all the open tables used by
all the backends. If you have 50 backends each using 10 each with 3
indexes, you now need 2,000 mappings in the system. Assuming that there
are 2001 pages available for mapping, how do you decide with table gets
to map 2 pages? How do you get all the backends to agree about this?

IMHO, this is also not that much problem as it looks like. When the
system is running out of virtual memory, the occupied pages are
paged-out. The system does what actually buffer manager does - it writes
down the pages that are dirty, and simply frees memory from those that
are not modified on a last recently used basis. So the only thing that
costs are the memory structures that describe the bindings between disk
blocks and memory. And of course it's sometimes bad to use LRU
algorithm. Sometimes backend knows better which pages are best to
page-out.

I have to admit that this point seems to be potential source of
performance drop-downs and all the backends have to communicate to
prevent it. But I don't think that this communication is huge. Note that
currently all backends use quite large communication channel (256 pages
large by default?) which is hardly used for communication purposes but
rather for storage.

Perhaps. Still, to implement this would be a major task. I would prefer to
spend that effort on adding page or row level locking for instance.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software 300 Lakeside Drive Oakland, CA 94612
- A child of five could understand this! Fetch me a child of five.