configurability of OOM killer

Started by Alvaro Herreraover 18 years ago47 messageshackers

Jump to latest

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

This page

http://linux-mm.org/OOM_Killer

says that you can hint the OOM killer to be more deferential towards
certain processes.

I am wondering if we can set the system up so that it skips postmaster,
bgwriter etc, and feels more preference towards normal backends (but
then, we would try to give them less points than other regular
processes). That could make the system more robust overall, even if the
sysadmin hasn't configured it.

Incidentally, the same page notes that points are substracted from
processes with raw IO capability; which means *r*cle is probably
avoiding this problem altogether.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Alvaro Herrera (#1)

Re: configurability of OOM killer

Alvaro Herrera <alvherre@commandprompt.com> writes:

This page
http://linux-mm.org/OOM_Killer

Egad. Whoever thought *this* was a good idea should be taken out
and shot:

The independent memory size of any child (except a kernel thread) is added to the score:

/*
* Processes which fork a lot of child processes are likely
* a good choice. We add the vmsize of the childs if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of childs
*/

In other words, server daemons are preferentially killed, and the parent
will *always* get zapped in place of its child (since the child cannot
have a higher score). No wonder we have to turn off OOM kill.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

over 18 years ago

In reply to: Tom Lane (#2)

Re: configurability of OOM killer

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

This page
http://linux-mm.org/OOM_Killer

Egad. Whoever thought *this* was a good idea should be taken out
and shot:

The independent memory size of any child (except a kernel thread) is added to the score:

/*
* Processes which fork a lot of child processes are likely
* a good choice. We add the vmsize of the childs if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of childs
*/

In other words, server daemons are preferentially killed, and the parent
will *always* get zapped in place of its child (since the child cannot
have a higher score). No wonder we have to turn off OOM kill.

That was pretty much my reaction.

And it looks like you can't turn it off for postgres processes because
that works by process group and we call setsid(), so we aren't in a
single process group.

cheers

andrew

Florian Weimer

fweimer@bfk.de

over 18 years ago

In reply to: Alvaro Herrera (#1)

Re: configurability of OOM killer

* Alvaro Herrera:

I am wondering if we can set the system up so that it skips postmaster,
bgwriter etc, and feels more preference towards normal backends (but
then, we would try to give them less points than other regular
processes). That could make the system more robust overall, even if the
sysadmin hasn't configured it.

How much does that help? Postmaster &c still need to be shut down
when a regular backend dies due to SIGKILL.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Florian Weimer (#4)

Re: configurability of OOM killer

Florian Weimer <fweimer@bfk.de> writes:

* Alvaro Herrera:

I am wondering if we can set the system up so that it skips postmaster,

How much does that help? Postmaster &c still need to be shut down
when a regular backend dies due to SIGKILL.

The $64 problem is that if the parent postmaster process is victimized
by the OOM killer, you won't get an automatic restart. In most people's
eyes that is considerably worse than the momentary DOS imposed by a kill
of a child backend. And what we now find, which is truly staggeringly
stupid on the kernel's part, is that it *preferentially* kills the
parent instead of whatever child might actually be eating the memory.

regards, tom lane

Florian Weimer

fweimer@bfk.de

over 18 years ago

In reply to: Tom Lane (#5)

Re: configurability of OOM killer

* Tom Lane:

How much does that help? Postmaster &c still need to be shut down
when a regular backend dies due to SIGKILL.

The $64 problem is that if the parent postmaster process is victimized
by the OOM killer, you won't get an automatic restart.

The classic answer to that is to put it into inittab. 8-/

In most people's eyes that is considerably worse than the momentary
DOS imposed by a kill of a child backend. And what we now find,
which is truly staggeringly stupid on the kernel's part, is that it
*preferentially* kills the parent instead of whatever child might
actually be eating the memory.

IIRC, the idea is to get the machine out of OOM land with one killed
process, even if it causes dependent processes to fail. No matter
what you do at this point, you lose. If you prefer the child instead
of the parent, the parent might just reattempt the fork() (which
succeeds thanks to COW), and the child runs into the same OOM
condition.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Florian Weimer (#6)

Re: configurability of OOM killer

Florian Weimer <fweimer@bfk.de> writes:

* Tom Lane:

The $64 problem is that if the parent postmaster process is victimized
by the OOM killer, you won't get an automatic restart.

The classic answer to that is to put it into inittab. 8-/

Except that no standard services are actually run that way, for
sundry good-n-sufficient reasons.

In most people's eyes that is considerably worse than the momentary
DOS imposed by a kill of a child backend. And what we now find,
which is truly staggeringly stupid on the kernel's part, is that it
*preferentially* kills the parent instead of whatever child might
actually be eating the memory.

IIRC, the idea is to get the machine out of OOM land with one killed
process, even if it causes dependent processes to fail.

You're just parroting the reasoning given on the cited webpage, which
is loony because it takes no account whatsoever of actual practice.
Postgres is hardly the only daemon for which killing the parent results
in far worse DOS than not doing so. sendmail, sshd, inetd, and mysqld
are examples that come to mind immediately, and I am fairly sure that
it's true for apache as well.

Also, how is killing parent and child less invasive than killing only
the child (which is the one actually eating memory, in these cases)?
The reasoning isn't even self-consistent.

No matter what you do at this point, you lose.

Well, since the authors of the overcommit logic appear uninterested
in running stable userland services, turning it off is the only way
not to lose.

regards, tom lane

Florian Pflug

fgp@phlo.org

over 18 years ago

In reply to: Tom Lane (#5)

Re: configurability of OOM killer

Tom Lane wrote:

Florian Weimer <fweimer@bfk.de> writes:

* Alvaro Herrera:

I am wondering if we can set the system up so that it skips postmaster,

How much does that help? Postmaster &c still need to be shut down
when a regular backend dies due to SIGKILL.

The $64 problem is that if the parent postmaster process is victimized
by the OOM killer, you won't get an automatic restart. In most people's
eyes that is considerably worse than the momentary DOS imposed by a kill
of a child backend. And what we now find, which is truly staggeringly
stupid on the kernel's part, is that it *preferentially* kills the
parent instead of whatever child might actually be eating the memory.

Maybe we should just react equally brute-force, and just disable the
OOM-Killer for the postmaster if we're running on linux. It seems that
something like "echo -17 > /proc/<pid>/oom_adj" should do the trick.

And maybe add a note to the docs telling people to disable memory
overcommit on dedicated database servers if that isn't already there...

regards, Florian Pflug

Florian Weimer

fweimer@bfk.de

over 18 years ago

In reply to: Tom Lane (#7)

Re: configurability of OOM killer

* Tom Lane:

IIRC, the idea is to get the machine out of OOM land with one killed
process, even if it causes dependent processes to fail.

You're just parroting the reasoning given on the cited webpage, which
is loony because it takes no account whatsoever of actual practice.

Oops, I hadn't actually read it (I can't reach the Web from this
terminal).

Postgres is hardly the only daemon for which killing the parent results
in far worse DOS than not doing so. sendmail, sshd, inetd, and mysqld
are examples that come to mind immediately, and I am fairly sure that
it's true for apache as well.

Historically, the OOM killer was mainly there to avoid a total lock-up
or straight reboot on single-user machines with text-mode console and
the occassional broken shell script. For example, it used to kill the
X server, too. Anyway, a dead SSH session or database server is less
of a DoS than a lock-up due to the OOM killer's inability to recover
resources in a reasonable time frame. (I'd need to check if it
prefers killing the main sshd daemon. That would be rather
inconvenient.)

And let me repeat: If some shell script à la

for x in *; do foo $x; done

is causing the trouble, you need to kill the parent (the shell) to
bring the system back. Killing foo brings only very short-term
relief.

Fortunately, it's possible to turn off overcommitment nowadays, so
it's not such a huge issue anymore (for me, at least). Some
applications are still not fully compatible with this mode (SBCL, for
instance, and the Sun JVM doesn't perform as well as it could,
either), but there are astonishingly few problems with
vm.overcommit_memory=2.

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#10

Andrew Dunstan

andrew@dunslane.net

over 18 years ago

In reply to: Florian Pflug (#8)

Re: configurability of OOM killer

Florian G. Pflug wrote:

Maybe we should just react equally brute-force, and just disable the
OOM-Killer for the postmaster if we're running on linux. It seems that
something like "echo -17 > /proc/<pid>/oom_adj" should do the trick.

That will protect the postmaster but none of the children. And it will
be very fragile, as only root can do it.

And maybe add a note to the docs telling people to disable memory
overcommit on dedicated database servers if that isn't already there...

It is there, and has been for years.

cheers

andrew

#11

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Andrew Dunstan (#10)

Re: configurability of OOM killer

Andrew Dunstan <andrew@dunslane.net> writes:

Florian G. Pflug wrote:

Maybe we should just react equally brute-force, and just disable the
OOM-Killer for the postmaster if we're running on linux. It seems that
something like "echo -17 > /proc/<pid>/oom_adj" should do the trick.

That will protect the postmaster but none of the children. And it will
be very fragile, as only root can do it.

However, init-scripts do run as root, so this is something that the RPM
packages could theoretically do. I wonder whether it would be seen as
good packaging practice ;-)

Not protecting the children is probably sane, since it's perfectly
possible for one of them to blow up memory-wise. If you're going
to protect them then there's little point in enabling the OOM killer
at all.

And maybe add a note to the docs telling people to disable memory
overcommit on dedicated database servers if that isn't already there...

It is there, and has been for years.

Another thought is to tell people to run the postmaster under a
per-process memory ulimit that is conservative enough so that the
system can't get into the regime where the OOM killer activates.
ulimit actually behaves the way we want, ie, it's polite about
telling you you can't have more memory ;-).

The problem with that is that the DBA has to do the math about what he
can afford as a per-process ulimit, and it seems a fairly error-prone
calculation. Is there any way we could automate it, in whole or
in part? We are certainly capable of setting the ulimit ourselves
if we can figure out what it should be.

regards, tom lane

#12

Florian Pflug

fgp@phlo.org

over 18 years ago

In reply to: Tom Lane (#11)

Re: configurability of OOM killer

Tom Lane wrote:

Another thought is to tell people to run the postmaster under a
per-process memory ulimit that is conservative enough so that the
system can't get into the regime where the OOM killer activates.
ulimit actually behaves the way we want, ie, it's polite about
telling you you can't have more memory ;-).

That will only work if postgres in the only service running on the
machine, though, no? If the postmaster and it's chilren use up 80% of
the available memory, then launching a forkbomb will still lead to the
postmaster being killed (Since it will get the most points). Or at least
this is how I interpret link posted originally.

And *if* postgres is the only service, does setting a ulimit have an
advantage over disabling memory overcommitting?

AFAICS, memory overcommit helps if a program creates 50mb of mosty
read-only data, and than forks 10 times, or if it maps a large amount of
memory but writes to that block only sparsely. Since postgres does
neither, a dedicated postgres server won't see any benefits from
overcommitting memory I'd think.

regards, Florian Pflug

#13

Dimitri Fontaine

dimitri@2ndQuadrant.fr

over 18 years ago

In reply to: Florian Weimer (#9)

Re: configurability of OOM killer

Hi,

Le Saturday 02 February 2008 20:39:15 Florian Weimer, vous avez écrit :

Oops, I hadn't actually read it (I can't reach the Web from this
terminal).

I had a friend in the same situation as you seem to be in and implemented a
mail bot for him to somewhat access documents on the www:
http://mbot.nah-ko.org/
http://packages.debian.org/mbot

If you have a "friendly" mail server machine where to host the mbot software,
you then gain back the ability to read online stuff from any mail-only
terminal setup.

Please consider I've not been working on this software for a long time now
(several *years*) and I'm not planning to do anytime soon... and it doesn't
relay POST requests, only GET ones, at the moment.

Regards, hope this helps,
--
dim

#14

Martijn van Oosterhout

kleptog@svana.org

over 18 years ago

In reply to: Florian Pflug (#12)

Re: configurability of OOM killer

On Sat, Feb 02, 2008 at 09:49:05PM +0100, Florian G. Pflug wrote:

AFAICS, memory overcommit helps if a program creates 50mb of mosty
read-only data, and than forks 10 times, or if it maps a large amount of
memory but writes to that block only sparsely. Since postgres does
neither, a dedicated postgres server won't see any benefits from
overcommitting memory I'd think.

While this was probably intented to be funny, postgres does in fact
load 10mb of mostly read-only data (the
binary/libc/ssl/locales/kerberos add up to about 10mb on my machine) it
subsequently forks a dozen times, one for each connection. So postgres
is *exactly* such a program. If you start preloading
plperl/plpython/etc it grows even faster.

Now, postgres almost certainly will never change much of it so it's not
a big deal, but it could if it wanted to and that what overcommit was
designed for: banking on the fact that 99% of the time, that space
isn't written to. Overcommit is precisely what makes forking as cheap
as threads.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Those who make peaceful revolution impossible will make violent revolution inevitable.
-- John F Kennedy

#15

Andrew Dunstan

andrew@dunslane.net

over 18 years ago

In reply to: Martijn van Oosterhout (#14)

Re: configurability of OOM killer

Martijn van Oosterhout wrote:

On Sat, Feb 02, 2008 at 09:49:05PM +0100, Florian G. Pflug wrote:

AFAICS, memory overcommit helps if a program creates 50mb of mosty
read-only data, and than forks 10 times, or if it maps a large amount of
memory but writes to that block only sparsely. Since postgres does
neither, a dedicated postgres server won't see any benefits from
overcommitting memory I'd think.

While this was probably intented to be funny, postgres does in fact
load 10mb of mostly read-only data (the
binary/libc/ssl/locales/kerberos add up to about 10mb on my machine) it
subsequently forks a dozen times, one for each connection. So postgres
is *exactly* such a program. If you start preloading
plperl/plpython/etc it grows even faster.

Now, postgres almost certainly will never change much of it so it's not
a big deal, but it could if it wanted to and that what overcommit was
designed for: banking on the fact that 99% of the time, that space
isn't written to. Overcommit is precisely what makes forking as cheap
as threads.

1. Isn't most of that space program text in segments marked read-only?
2. I always turn on strict memory accounting on Linux. I haven't noticed
that it has had any performance effect. But it does pretty much do away
with the likelihood of having postgres killed from under me, AFAIK.

cheers

andrew

#16

Bruce Momjian

bruce@momjian.us

over 18 years ago

In reply to: Martijn van Oosterhout (#14)

Re: configurability of OOM killer

"Martijn van Oosterhout" <kleptog@svana.org> writes:

On Sat, Feb 02, 2008 at 09:49:05PM +0100, Florian G. Pflug wrote:

AFAICS, memory overcommit helps if a program creates 50mb of mosty
read-only data, and than forks 10 times, or if it maps a large amount of
memory but writes to that block only sparsely. Since postgres does
neither, a dedicated postgres server won't see any benefits from
overcommitting memory I'd think.

While this was probably intented to be funny, postgres does in fact
load 10mb of mostly read-only data (the
binary/libc/ssl/locales/kerberos add up to about 10mb on my machine) it
subsequently forks a dozen times, one for each connection. So postgres
is *exactly* such a program. If you start preloading
plperl/plpython/etc it grows even faster.

Now, postgres almost certainly will never change much of it so it's not
a big deal, but it could if it wanted to

Actually no, at least on Linux the shared library linker maps shared libraries
read-only, so it really can't. Not without changing the mapping at which point
the kernel could adjust its memory counts.

However the reference to plperl and plpython is more apt. At least with perl
on Apache it's quite common to arrange to load as many modules as possible
before forking. That way the worker processes have shared copies of those
modules which, even though they're most certainly in writable memory are
mostly kept shared.

The real screw case that overcommit is intended for is actually large programs
-- like postgres -- which call fork/exec small programs often. So for example
if you have postgres calling system() it should be allowed to do the fork even
if there aren't many megabytes free because it's only going to exec some small
program like pg_standby. This is especially nasty when you realize that bash
itself is one such large program...

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#17

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Martijn van Oosterhout (#14)

Re: configurability of OOM killer

Martijn van Oosterhout <kleptog@svana.org> writes:

Now, postgres almost certainly will never change much of it so it's not
a big deal, but it could if it wanted to and that what overcommit was
designed for: banking on the fact that 99% of the time, that space
isn't written to. Overcommit is precisely what makes forking as cheap
as threads.

Nonsense. Copy-on-write is what makes forking as cheap as threads.

Now it's true that strict accounting requires the kernel to be prepared
to make a lot of page copies that it will never actually need in
practice. In my mind that's what swap space is for: it's the buffer
that the kernel *would* need if there were suddenly a lot more
copies-on-write than it'd been expecting.

As already noted, code pages are generally read-only and need not factor
into the calculation at all. I'm not sure how much potentially-writable
storage is really forked off by the postmaster, but I doubt it's in the
tens-of-MB range.

regards, tom lane

#18

Simon Riggs

simon@2ndQuadrant.com

over 18 years ago

In reply to: Tom Lane (#2)

Re: configurability of OOM killer

On Fri, 2008-02-01 at 19:08 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

This page
http://linux-mm.org/OOM_Killer

Egad. Whoever thought *this* was a good idea should be taken out
and shot:

The independent memory size of any child (except a kernel thread) is added to the score:

/*
* Processes which fork a lot of child processes are likely
* a good choice. We add the vmsize of the childs if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of childs
*/

In other words, server daemons are preferentially killed, and the parent
will *always* get zapped in place of its child (since the child cannot
have a higher score). No wonder we have to turn off OOM kill.

This does look bad.

I think we should fix this problem, though I see the problem as being
Postgres not being able to set and adhere to memory limits. The OS
doesn't favour us on this point, but I think we will be ignored when we
have to explain that we don't strictly control the memory we allocate
and use.

I would like there to be a way for us to say "The server is limited to
using at most X amount of memory." There might be various ways of doing
it, but I'd like to agree that as an important goal for 8.4 dev

The benefit of doing this is that we won't have to allocate a certain
percentage of memory as contingency to avoid swapping and OOM killers.
So putting in place a memory limit will effectively increase the
available memory the server has access to and/or limit swapping, either
of which will be a performance increase.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#19

Ron Mayer

rm_pg@cheapcomplexdevices.com

over 18 years ago

In reply to: Tom Lane (#2)

Re: configurability of OOM killer

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

... OOM_Killer

Egad. Whoever thought *this* was a good idea should be taken out
and shot:

If I read this right, http://lkml.org/lkml/2007/2/9/275 even the
shared memory is counted many times (once per child) for the
parent process - even though it's (obviously) not copy-on-write
so the shared memory's unlikely to contribute to problems.

I wonder if postgres startup should write something (warning?
at least log?) in the log file if the OOM killer is enabled. I assume
most people who care deeply about their database dying would notice a
warning in log files; while most people who don't mind the OOM killer
also wouldn't be too bothered by extra noise in the file.

#20

Jeff Davis

pgsql@j-davis.com

over 18 years ago

In reply to: Tom Lane (#2)

Re: configurability of OOM killer

On Fri, 2008-02-01 at 19:08 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

This page
http://linux-mm.org/OOM_Killer

Egad. Whoever thought *this* was a good idea should be taken out
and shot:

/*
* Processes which fork a lot of child processes are likely
* a good choice. We add the vmsize of the childs if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of childs
*/

In other words, server daemons are preferentially killed, and the parent
will *always* get zapped in place of its child (since the child cannot
have a higher score). No wonder we have to turn off OOM kill.

Technically, the child could have a higher score, because it only counts
half of the total vm size of the children. At first glance it's not that
bad of an idea, except that it takes into account the total vm size
(including shared memory), not only memory that is exclusive to the
process in question.

It's pretty easy to see that badness() (the function that determines
which process is killed when the OOM killer is invoked) will count the
same byte of memory many times over when calculating the "badness" of a
process like the postgres daemon. If you have shared_buffers=1GB on a
4GB box, and 100 connections open, badness() apparently thinks
postgresql is using about 50GB of memory. Oops. One would think a VM
hacker would know better.

I tried bringing this up on LKML several times (Ron Mayer linked to one
of my posts: http://lkml.org/lkml/2007/2/9/275). If anyone has an inside
connection to the linux developer community, I suggest that they raise
this issue.

If you want to experiment, start a postgres process with shared_buffers
set at 25% of the available memory, and then start about 100 idle
connections. Then, start a process that just slowly eats memory, such
that it will invoke the OOM killer after a couple minutes (badness()
takes into account the time the process has been alive, as well, so you
can't just eat memory in a tight loop).

The postgres process will always be killed, and then it will realize
that it didn't alleviate the memory pressure much, and then kill the
runaway process.

Regards,
Jeff Davis

#21

Simon Riggs

simon@2ndQuadrant.com

over 18 years ago

In reply to: Jeff Davis (#20)

#22

Jeff Davis

pgsql@j-davis.com

over 18 years ago

In reply to: Simon Riggs (#21)

#23