POSIX question

Started by Radosław Smoguraover 14 years ago13 messages
#1Radosław Smogura
rsmogura@softperience.eu

Hello,

I had some idea with hugepagse, and I read why PostgreSQL doesn't
support POSIX (need of nattach). During read about POSIX/SysV I found
this (thread about dynamic chunking shared memory).

http://archives.postgresql.org/pgsql-hackers/2010-08/msg00586.php

When playing with mmap I done some approach how to deal with growing
files, so...

Maybe this approach could resolve both of above problems (POSIX and
dynamic shared memory). Here is idea:

1. mmap some large amount of anonymous virtual memory (this will be
maximum size of shared memory).
2. init small SysV chunk for shmem header (to keep "fallout"
requirements)
3. SysV remap is Linux specific so unmap few 1st vm pages of step 1.
and attach there (2.)
3. a. Lock header when adding chunks (1st chunk is header) (we don't
want concurrent chunk allocation)
4. allocate some other chunks of shared memory (POSIX is the best way)
and put it in shmem header, put there information like chunk id/name, is
this POSIX or SysV, some useful flags (hugepage?) needed by reattaching,
attach those in 1.
4b. unlock 3a

Point 1. will no eat memory, as memory allocation is delayed and in
64bit platforms you may reserve quite huge chunk of this, and in future
it may be possible using mmap / munmap to concat chunks / defrag it etc.

Mmap guarants that mmaping with mmap_fixed over already mmaped area
will unmap old.

A working "preview" changeset applied for sysv_memory.c maybe quite
small.

If someone will want to "extend" memory, he may add new chunk (ofcourse
to keep header memory continuous number of chunks is limited).

What do you think about this?

Regards,
Radek

#2Florian Pflug
fgp@phlo.org
In reply to: Radosław Smogura (#1)
Re: POSIX question

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:

1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).
...
Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc.

I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there's a API for that...

best regards,
Florian Pflug

#3Radosław Smogura
rsmogura@softperience.eu
In reply to: Florian Pflug (#2)
Re: POSIX question

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:

1. mmap some large amount of anonymous virtual memory (this will be
maximum size of shared memory). ...
Point 1. will no eat memory, as memory allocation is delayed and in 64bit
platforms you may reserve quite huge chunk of this, and in future it may
be possible using mmap / munmap to concat chunks / defrag it etc.

I think this breaks with strict overcommit settings (i.e.
vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell
the kernel (or glibc) to simply reserve a chunk of virtual address space
for further user. Not sure if there's a API for that...

best regards,
Florian Pflug

This may be achived by many other things, like mmap /dev/null.

Regards,
Radek

#4Florian Weimer
fweimer@bfk.de
In reply to: Florian Pflug (#2)
Re: POSIX question

* Florian Pflug:

I think this breaks with strict overcommit settings
(i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a
way to tell the kernel (or glibc) to simply reserve a chunk of virtual
address space for further user. Not sure if there's a API for that...

mmap with PROT_NONE and subsequent update with mprotect does this on
Linux.

(It's not clear to me what this is trying to solve, though.)

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#5Florian Pflug
fgp@phlo.org
In reply to: Radosław Smogura (#3)
Re: POSIX question

On Jun20, 2011, at 16:39 , Radosław Smogura wrote:

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:

1. mmap some large amount of anonymous virtual memory (this will be
maximum size of shared memory). ...
Point 1. will no eat memory, as memory allocation is delayed and in 64bit
platforms you may reserve quite huge chunk of this, and in future it may
be possible using mmap / munmap to concat chunks / defrag it etc.

I think this breaks with strict overcommit settings (i.e.
vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell
the kernel (or glibc) to simply reserve a chunk of virtual address space
for further user. Not sure if there's a API for that...

best regards,
Florian Pflug

This may be achived by many other things, like mmap /dev/null.

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()
are large block at once. (This allows it to actually return the memory
to the kernel once you free() it, which isn't possible if the memory
was allocated simply by extending the heap).

You can work around this by mmap()ing an actual file, because then
the kernel knows it can use the file as backing store and thus doesn't
need to reserve actual physical memory. (In a way, this just adds
additional swap space). Doesn't seem very clean though...

Even if there's a way to work around a strict overcommit setting, unless
the workaround is a syscall *explicitly* designed for that purpose, I'd
be very careful with using it. You might just as well be exploiting a
bug in the overcommit accounting logic and future kernel versions may
simply choose to fix the bug...

best regards,
Florian Pflug

#6Radosław Smogura
rsmogura@softperience.eu
In reply to: Florian Pflug (#5)
Re: POSIX question

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:01:40

On Jun20, 2011, at 16:39 , Radosław Smogura wrote:

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:

1. mmap some large amount of anonymous virtual memory (this will be
maximum size of shared memory). ...
Point 1. will no eat memory, as memory allocation is delayed and in
64bit platforms you may reserve quite huge chunk of this, and in
future it may be possible using mmap / munmap to concat chunks /
defrag it etc.

I think this breaks with strict overcommit settings (i.e.
vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to
tell the kernel (or glibc) to simply reserve a chunk of virtual address
space for further user. Not sure if there's a API for that...

best regards,
Florian Pflug

This may be achived by many other things, like mmap /dev/null.

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()
are large block at once. (This allows it to actually return the memory
to the kernel once you free() it, which isn't possible if the memory
was allocated simply by extending the heap).

You can work around this by mmap()ing an actual file, because then
the kernel knows it can use the file as backing store and thus doesn't
need to reserve actual physical memory. (In a way, this just adds
additional swap space). Doesn't seem very clean though...

Even if there's a way to work around a strict overcommit setting, unless
the workaround is a syscall *explicitly* designed for that purpose, I'd
be very careful with using it. You might just as well be exploiting a
bug in the overcommit accounting logic and future kernel versions may
simply choose to fix the bug...

best regards,
Florian Pflug

I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
about 100GB of memory.

Regards,
Radek

#7Florian Pflug
fgp@phlo.org
In reply to: Radosław Smogura (#6)
Re: POSIX question

On Jun20, 2011, at 17:05 , Radosław Smogura wrote:

I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
about 100GB of memory.

You need to set vm.overcommit_memory to "2" to see the difference. Did
you do that?

You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory"
or by editing /etc/sysctl.conf and issuing "sysctl -p".

best regards,
Florian Pflug

#8Roger Leigh
rleigh@codelibre.net
In reply to: Florian Pflug (#2)
Re: POSIX question

On Mon, Jun 20, 2011 at 04:16:58PM +0200, Florian Pflug wrote:

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:

1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).
...
Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc.

I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there's a API for that...

I run discless swapless cluster systems with zero overcommit (i.e.
it's entirely disabled), which means that all operations are
strict success/fail due to allocation being immediate. mmap of a
large amount of anonymous memory would almost certainly fail on
such a setup--you definitely can't assume that a large anonymous
mmap will always succeed, since there is no delayed allocation.

[we do in reality have a small overcommit allowance to permit
efficient fork(2), but it's tiny and (in this context) irrelevant]

Regards,
Roger

--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.

#9Andres Freund
andres@anarazel.de
In reply to: Radosław Smogura (#6)
Re: POSIX question

On Monday, June 20, 2011 17:05:48 Radosław Smogura wrote:

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:01:40

On Jun20, 2011, at 16:39 , Radosław Smogura wrote:

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:

1. mmap some large amount of anonymous virtual memory (this will be
maximum size of shared memory). ...
Point 1. will no eat memory, as memory allocation is delayed and in
64bit platforms you may reserve quite huge chunk of this, and in
future it may be possible using mmap / munmap to concat chunks /
defrag it etc.

I think this breaks with strict overcommit settings (i.e.
vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to
tell the kernel (or glibc) to simply reserve a chunk of virtual
address space for further user. Not sure if there's a API for that...

best regards,
Florian Pflug

This may be achived by many other things, like mmap /dev/null.

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()
are large block at once. (This allows it to actually return the memory
to the kernel once you free() it, which isn't possible if the memory
was allocated simply by extending the heap).

You can work around this by mmap()ing an actual file, because then
the kernel knows it can use the file as backing store and thus doesn't
need to reserve actual physical memory. (In a way, this just adds
additional swap space). Doesn't seem very clean though...

Even if there's a way to work around a strict overcommit setting, unless
the workaround is a syscall *explicitly* designed for that purpose, I'd
be very careful with using it. You might just as well be exploiting a
bug in the overcommit accounting logic and future kernel versions may
simply choose to fix the bug...

best regards,
Florian Pflug

I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
about 100GB of memory.

The default setting is to allow overcommit.

Andres

#10Greg Stark
stark@mit.edu
In reply to: Florian Pflug (#5)
Re: POSIX question

On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp@phlo.org> wrote:

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()

It mmaps /dev/zero actually.

--
greg

#11Radosław Smogura
rsmogura@softperience.eu
In reply to: Florian Pflug (#7)
Re: POSIX question

Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:07:55

On Jun20, 2011, at 17:05 , Radosław Smogura wrote:

I'm sure at 99%. When I ware "playing" with mmap I preallocated,
probably, about 100GB of memory.

You need to set vm.overcommit_memory to "2" to see the difference. Did
you do that?

You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory"
or by editing /etc/sysctl.conf and issuing "sysctl -p".

best regards,
Florian Pflug

I've just created 127TB mapping in Linux - maximum allowed by VM. Trying
overcommit with 0,1,2.

Regards,
Radek

#12Andres Freund
andres@anarazel.de
In reply to: Greg Stark (#10)
Re: POSIX question

On Monday, June 20, 2011 17:11:14 Greg Stark wrote:

On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp@phlo.org> wrote:

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()

It mmaps /dev/zero actually.

As the nitpicking has already started: Afair its just passing -1 as fd and
uses the MAP_ANONYMOUS flag argument ;)

Andres

#13Markus Wanner
markus@bluegap.ch
In reply to: Radosław Smogura (#1)
Re: POSIX question

Radek,

On 06/20/2011 03:27 PM, Radosław Smogura wrote:

When playing with mmap I done some approach how to deal with growing
files, so...

Your approach seems to require a SysV alloc (for nattach) as well as
POSIX shmem and/or mmap. Adding requirements for these syscalls
certainly needs to give a good benefit for Postgres, as they presumably
pose portability issues.

3. a. Lock header when adding chunks (1st chunk is header) (we don't
want concurrent chunk allocation)

Sure we don't? There are at least a dozen memory allocators for
multi-threaded applications, all trying to optimize for concurrency.
The programmer of a multi-threaded application doesn't need to care much
about concurrent allocations. He can allocate (and free) quite a lot of
tiny chunks concurrently from shared memory.

Regards

Markus Wanner