/proc/self/oom_adj is deprecated in newer Linux kernels
While testing 9.1 RPMs on Fedora 15 (2.6.40 kernel), I notice
messages like these in the kernel log:
Sep 11 13:38:56 rhl kernel: [ 415.308092] postgres (18040): /proc/18040/oom_adj is deprecated, please use /proc/18040/oom_score_adj instead.
These don't show up on every single PG process launch, but that probably
just indicates there's a rate-limiter in the kernel reporting mechanism.
So it looks like it behooves us to cater for oom_score_adj in the
future. The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
If we do that, then it would be up to the packager whether to #define
LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior
he wants.
That would be good enough for my own purposes in building Fedora/RHEL
packages, since I can predict with confidence which kernel versions a
given build is likely to be used with. I think probably the same
would be true for most other distro-specific builds. Does anyone want
to argue for doing something more complicated, and if so what exactly?
regards, tom lane
On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Does anyone want
to argue for doing something more complicated, and if so what exactly?
Well there's no harm trying to write to oom_score_adj and if that
fails with EEXISTS trying to write to oom_adj.
--
greg
Greg Stark <stark@mit.edu> writes:
On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Does anyone want
to argue for doing something more complicated, and if so what exactly?
Well there's no harm trying to write to oom_score_adj and if that
fails with EEXISTS trying to write to oom_adj.
Well, (1) what value are you going to write (they need to be different
for the two files), and (2) the main point of the exercise, at present,
is to avoid kernel log messages. I'm not sure that trying to create
random files under /proc isn't going to draw bleats in the kernel log.
regards, tom lane
Excerpts from Tom Lane's message of vie sep 16 13:37:46 -0300 2011:
Greg Stark <stark@mit.edu> writes:
On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Does anyone want
to argue for doing something more complicated, and if so what exactly?Well there's no harm trying to write to oom_score_adj and if that
fails with EEXISTS trying to write to oom_adj.Well, (1) what value are you going to write (they need to be different
for the two files), and (2) the main point of the exercise, at present,
is to avoid kernel log messages. I'm not sure that trying to create
random files under /proc isn't going to draw bleats in the kernel log.
I guess the question is what semantics the new code has. In the old
badness() world, child processes inherited the oom_adj value of its
parent. What the code in fork_process was used for was resetting the
value back to 0 (meaning "kernel is allowed to kill this process on
OOM"), so that you could set the oom_adj in the start script for
postmaster (to a value meaning "never kill this one"), and the backends
would see their values reset to zero.
The new oom_score_adj has a scale of -1000 to +1000, with zero being
neutral and -1000 being "never kill". So what we want to do here in
most cases, it seems, is set the value to zero whether it's oom_adj or
oom_score_adj -- assuming the new code is still causing children
processes to inherit the "adj" value from the parent.
Now the problem is that we have defined the LINUX_OOM_ADJ symbol as
meaning the value we're going to write. Maybe this wasn't the best
choice. I mean, it's very flexible, but it doesn't seem to offer any
benefit over a plain boolean choice.
Is your proposal to create a new LINUX_OOM_SCORE_ADJ cpp symbol with the
same semantics?
The most thorough documentation on this option seems to be this commit:
https://github.com/mirrors/linux-2.6/commit/a63d83f427fbce97a6cea0db2e64b0eb8435cd10#include/linux/oom.h
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Now the problem is that we have defined the LINUX_OOM_ADJ symbol as
meaning the value we're going to write. Maybe this wasn't the best
choice. I mean, it's very flexible, but it doesn't seem to offer any
benefit over a plain boolean choice.
Is your proposal to create a new LINUX_OOM_SCORE_ADJ cpp symbol with the
same semantics?
Yes, that's what I was thinking. We could avoid that if we were going
to hard-wire a decision that zero is the thing to write, but I see no
reason to place such a restriction on users. Who's to say they might
not want backends to adopt some other value?
regards, tom lane
On Fri, Sep 16, 2011 at 10:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <stark@mit.edu> writes:
On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Does anyone want
to argue for doing something more complicated, and if so what exactly?Well there's no harm trying to write to oom_score_adj and if that
fails with EEXISTS trying to write to oom_adj.
Yeah, I don't really like the idea of a compile time option that is
kernel version dependent... But i don't feel too strongly about it
either (all my kernels are new enough that they support
oom_score_adj).
I'll also note that on my system we are in the good company of ssd and chromium:
sshd (978): /proc/978/oom_adj is deprecated, please use
/proc/978/oom_score_adj instead.
chromium-sandbo (1377): /proc/1375/oom_adj is deprecated, please use
/proc/1375/oom_score_adj instead.
[ It quite annoying that soon after we decided to stick
-DLINUX_OOM_ADJ in they changed it. Whatever happened to a stable
userspace API :-( ]
On fre, 2011-09-16 at 10:57 -0400, Tom Lane wrote:
So it looks like it behooves us to cater for oom_score_adj in the
future. The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
If we do that, then it would be up to the packager whether to #define
LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior
he wants.
There are lots of reasons why that won't work: backports, forward ports,
derivatives, custom kernels, distribution upgrades, virtual hosting.
ISTM that we could at least query the currently used kernel version.
Peter Eisentraut <peter_e@gmx.net> writes:
On fre, 2011-09-16 at 10:57 -0400, Tom Lane wrote:
So it looks like it behooves us to cater for oom_score_adj in the
future. The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
If we do that, then it would be up to the packager whether to #define
LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior
he wants.
There are lots of reasons why that won't work: backports, forward ports,
derivatives, custom kernels, distribution upgrades, virtual hosting.
[ shrug... ] These are all hypothetical reasons. A packager who
foresaw needing that could just turn on both write attempts, or for that
matter patch the code to do whatever else he saw fit. In practice,
we've not had requests for anything significantly smarter than what is
there.
But having said that, it wouldn't be very hard to arrange things so that
if you did have both symbols defined, the code would only attempt to
write oom_adj if it had failed to write oom_score_adj; which is about as
close as you're likely to get to a kernel version test for this.
regards, tom lane
On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
On fre, 2011-09-16 at 10:57 -0400, Tom Lane wrote:
So it looks like it behooves us to cater for oom_score_adj in the
future. The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
If we do that, then it would be up to the packager whether to #define
LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior
he wants.There are lots of reasons why that won't work: backports, forward ports,
derivatives, custom kernels, distribution upgrades, virtual hosting.[ shrug... ] These are all hypothetical reasons. A packager who
foresaw needing that could just turn on both write attempts, or for that
matter patch the code to do whatever else he saw fit. In practice,
we've not had requests for anything significantly smarter than what is
there.But having said that, it wouldn't be very hard to arrange things so that
if you did have both symbols defined, the code would only attempt to
write oom_adj if it had failed to write oom_score_adj; which is about as
close as you're likely to get to a kernel version test for this.
Why is this feature not a run-time configuration variable or at least a
configure option? It's awfully well hidden now. I doubt a lot of
people are using this even though they might wish to.
Peter Eisentraut <peter_e@gmx.net> writes:
On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote:
But having said that, it wouldn't be very hard to arrange things so that
if you did have both symbols defined, the code would only attempt to
write oom_adj if it had failed to write oom_score_adj; which is about as
close as you're likely to get to a kernel version test for this.
Why is this feature not a run-time configuration variable or at least a
configure option? It's awfully well hidden now. I doubt a lot of
people are using this even though they might wish to.
See the thread in which the feature was designed originally:
http://archives.postgresql.org/pgsql-hackers/2010-01/msg00170.php
The key point is that to get useful behavior, you need cooperation
between both a root-privileged startup script and the PG executable.
That tends to throw the problem into the domain of packagers, more
than end users, and definitely puts a big crimp in the idea that
run-time configuration of just half of the behavior could be helpful.
So far, no Linux packagers have complained that the design is inadequate
(a position that I also hold when wearing my red fedora) so I do not
feel a need to complicate it further.
regards, tom lane
This is one way to prevent the kernel warning message without having to
introduce a new constant. Scale the old oom_adj-style value the same way
the kernel internally does it and use that instead if oom_score_adj is
available for writing.
Signed-off-by: Dan McGee <dan@archlinux.org>
---
This addresses some of the concerns raised on the ML and will at least keep
those of us running modern kernels happy.
Alternatively one could switch the symbol used to be the new style and have the
old one computed; however this is a pain for legacy vs. current versions.
src/backend/postmaster/fork_process.c | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c
index b2fe9a1..3cded54 100644
--- a/src/backend/postmaster/fork_process.c
+++ b/src/backend/postmaster/fork_process.c
@@ -81,16 +81,36 @@ fork_process(void)
* Use open() not stdio, to ensure we control the open flags. Some
* Linux security environments reject anything but O_WRONLY.
*/
- int fd = open("/proc/self/oom_adj", O_WRONLY, 0);
+ int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0);
/* We ignore all errors */
if (fd >= 0)
{
char buf[16];
+ int oom_score_adj;
+ /*
+ * The compile-time value is the old style oom_adj;
+ * scale it the same way the kernel does to
+ * convert to the new style oom_score_adj. This
+ * should become a constant at compile time.
+ * Valid values range from -17 (never kill) to
+ * 15; no attempt of validation is done.
+ */
+ oom_score_adj = LINUX_OOM_ADJ * 1000 / 17;
snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ);
(void) write(fd, buf, strlen(buf));
close(fd);
+ } else if (errno == EEXIST) {
+ int fd = open("/proc/self/oom_adj", O_WRONLY, 0);
+ if (fd >= 0)
+ {
+ char buf[16];
+
+ snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ);
+ (void) write(fd, buf, strlen(buf));
+ close(fd);
+ }
}
}
#endif /* LINUX_OOM_ADJ */
--
1.7.6.1
On Mon, Sep 19, 2011 at 3:11 PM, Dan McGee <dan@archlinux.org> wrote:
This is one way to prevent the kernel warning message without having to
introduce a new constant. Scale the old oom_adj-style value the same way
the kernel internally does it and use that instead if oom_score_adj is
available for writing.Signed-off-by: Dan McGee <dan@archlinux.org>
---This addresses some of the concerns raised on the ML and will at least keep
those of us running modern kernels happy.Alternatively one could switch the symbol used to be the new style and have the
old one computed; however this is a pain for legacy vs. current versions.src/backend/postmaster/fork_process.c | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c index b2fe9a1..3cded54 100644 --- a/src/backend/postmaster/fork_process.c +++ b/src/backend/postmaster/fork_process.c @@ -81,16 +81,36 @@ fork_process(void) * Use open() not stdio, to ensure we control the open flags. Some * Linux security environments reject anything but O_WRONLY. */ - int fd = open("/proc/self/oom_adj", O_WRONLY, 0); + int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0);/* We ignore all errors */
if (fd >= 0)
{
char buf[16];
+ int oom_score_adj;+ /* + * The compile-time value is the old style oom_adj; + * scale it the same way the kernel does to + * convert to the new style oom_score_adj. This + * should become a constant at compile time. + * Valid values range from -17 (never kill) to + * 15; no attempt of validation is done. + */ + oom_score_adj = LINUX_OOM_ADJ * 1000 / 17; snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ);
Of course it would help to actually use the computed value here; this should be:
snprintf(buf, sizeof(buf), "%d\n",
oom_score_adj);
Show quoted text
(void) write(fd, buf, strlen(buf)); close(fd); + } else if (errno == EEXIST) { + int fd = open("/proc/self/oom_adj", O_WRONLY, 0); + if (fd >= 0) + { + char buf[16]; + + snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); + (void) write(fd, buf, strlen(buf)); + close(fd); + } } } #endif /* LINUX_OOM_ADJ */ -- 1.7.6.1
On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee <dan@archlinux.org> wrote:
[ patch ]
I suppose it's Tom who really needs to comment on this, but I'm not
too enthusiastic about this approach. Duplicating the Linux kernel
calculation into our code means that we could drift if the formula
changes again.
I like Tom's previous suggestion (I think) of allowing both constants
to be defined - if they are, then we try oom_score_adj first and fall
back to oom_adj if that fails. For bonus points, we could have
postmaster stat() its own oom_score_adj file before forking and set a
global variable to indicate the results. That way we'd only ever need
to test once per postmaster startup (at least until someone figures
out a way to swap out the running kernel without stopping the
server...!).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee <dan@archlinux.org> wrote:
[ patch ]
I suppose it's Tom who really needs to comment on this, but I'm not
too enthusiastic about this approach. Duplicating the Linux kernel
calculation into our code means that we could drift if the formula
changes again.
Why would the formula ever change? This seems like a different excuse
way of really saying you don't appreciate the hacky approach, which I
can understand completely. However, it simply doesn't make sense for
them to change this formula, as it scales the -17 to 16 old range to
the new -1000 to 1000 range. Those endpoints won't be changing unless
a third method is introduced, in which case this whole thing is mute
and we'd need to fix it yet again.
I like Tom's previous suggestion (I think) of allowing both constants
to be defined - if they are, then we try oom_score_adj first and fall
back to oom_adj if that fails. For bonus points, we could have
postmaster stat() its own oom_score_adj file before forking and set a
global variable to indicate the results. That way we'd only ever need
to test once per postmaster startup (at least until someone figures
out a way to swap out the running kernel without stopping the
server...!).
This would be fine, it just seems unreasonably complicated, not to
mention unnecessary. I might as well point this out [1]http://www.google.com/codesearch#search/&q=LINUX_OOM_ADJ=&type=cs - Slackware and Fedora are the only hits not in the PG codebase, and both define it to 0.. I don't think
a soul out there has built without defining this to 0 (if they define
it at all), and not even that many people are using it. Is it all that
bad of an idea to just force it to 0 for both knobs and be done with
it?
-Dan McGee
[1]: http://www.google.com/codesearch#search/&q=LINUX_OOM_ADJ=&type=cs - Slackware and Fedora are the only hits not in the PG codebase, and both define it to 0.
- Slackware and Fedora are the only hits not in the PG codebase, and
both define it to 0.
On Fri, Sep 23, 2011 at 4:05 PM, Dan McGee <dan@archlinux.org> wrote:
On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee <dan@archlinux.org> wrote:
[ patch ]
I suppose it's Tom who really needs to comment on this, but I'm not
too enthusiastic about this approach. Duplicating the Linux kernel
calculation into our code means that we could drift if the formula
changes again.Why would the formula ever change? This seems like a different excuse
way of really saying you don't appreciate the hacky approach, which I
can understand completely. However, it simply doesn't make sense for
them to change this formula, as it scales the -17 to 16 old range to
the new -1000 to 1000 range. Those endpoints won't be changing unless
a third method is introduced, in which case this whole thing is mute
and we'd need to fix it yet again.I like Tom's previous suggestion (I think) of allowing both constants
to be defined - if they are, then we try oom_score_adj first and fall
back to oom_adj if that fails. For bonus points, we could have
postmaster stat() its own oom_score_adj file before forking and set a
global variable to indicate the results. That way we'd only ever need
to test once per postmaster startup (at least until someone figures
out a way to swap out the running kernel without stopping the
server...!).This would be fine, it just seems unreasonably complicated, not to
mention unnecessary. I might as well point this out [1]. I don't think
a soul out there has built without defining this to 0 (if they define
it at all), and not even that many people are using it. Is it all that
bad of an idea to just force it to 0 for both knobs and be done with
it?
Did we do anything about this? Anyone else have an opinion on what
ought to be done?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
[ oom_score_adj business ]
Did we do anything about this? Anyone else have an opinion on what
ought to be done?
I held off doing anything because it didn't seem like we had consensus.
OTOH, it may well be that it's not important enough to demand real
consensus, and he who does the work gets to make the choices.
regards, tom lane
On Mon, Oct 24, 2011 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
[ oom_score_adj business ]
Did we do anything about this? Anyone else have an opinion on what
ought to be done?I held off doing anything because it didn't seem like we had consensus.
OTOH, it may well be that it's not important enough to demand real
consensus, and he who does the work gets to make the choices.
Half a loaf is better than none.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I was reminded today that we still haven't done anything about this:
Tom Lane <tgl@sss.pgh.pa.us> writes:
While testing 9.1 RPMs on Fedora 15 (2.6.40 kernel), I notice
messages like these in the kernel log:
Sep 11 13:38:56 rhl kernel: [ 415.308092] postgres (18040): /proc/18040/oom_adj is deprecated, please use /proc/18040/oom_score_adj instead.
At this point there are no shipping Fedora versions that don't emit this
gripe, and F15 is even about to go EOL.
The previous discussion thread at
http://archives.postgresql.org/pgsql-hackers/2011-09/msg00794.php
went off into the weeds of what was in my opinion over-design.
I still think it's sufficient to do what I suggested initially:
... The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
If we do that, then it would be up to the packager whether to #define
LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior
he wants.
and would like to squeeze that into 9.2 so that we're only a year late
and not two years late in responding to this issue :-(.
Objections?
regards, tom lane
On Tue, Jun 12, 2012 at 12:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I was reminded today that we still haven't done anything about this:
Tom Lane <tgl@sss.pgh.pa.us> writes:
While testing 9.1 RPMs on Fedora 15 (2.6.40 kernel), I notice
messages like these in the kernel log:
Sep 11 13:38:56 rhl kernel: [ 415.308092] postgres (18040): /proc/18040/oom_adj is deprecated, please use /proc/18040/oom_score_adj instead.At this point there are no shipping Fedora versions that don't emit this
gripe, and F15 is even about to go EOL.The previous discussion thread at
http://archives.postgresql.org/pgsql-hackers/2011-09/msg00794.php
went off into the weeds of what was in my opinion over-design.
I still think it's sufficient to do what I suggested initially:... The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
If we do that, then it would be up to the packager whether to #define
LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior
he wants.and would like to squeeze that into 9.2 so that we're only a year late
and not two years late in responding to this issue :-(.Objections?
I think my position, and the position of the people who responded to
the original thread, was that it seems like you're architecting a
kludge when it wouldn't be that hard to architect a nicer solution.
That having been said, I don't think it's such a large kludge that we
should all have massive dry heaves at the thought of it living in our
repository. And then too, this isn't the time to be architecting new
9.2 feature anyway. So I say go for it. If it makes your life
easier, back-patch it. Code that requires a #define to enable it
won't break anything for anyone else.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Jun 12, 2012 at 12:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I still think it's sufficient to do what I suggested initially:
... The simplest, least risky change that I can think of is to
copy-and-paste the relevant #ifdef code block in fork_process.c.
I think my position, and the position of the people who responded to
the original thread, was that it seems like you're architecting a
kludge when it wouldn't be that hard to architect a nicer solution.
I'd be the first to agree it's a kluge. But the end result of the
previous discussion was that it wasn't so obvious how to architect
a nicer solution. Nor is there all that much room for people to use a
nicer solution, given that we need help from not just fork_process.c
but the root-privileged startup script (or lately in Fedora it's a
systemd unit script doing the privilege-increasing end of this).
In the short run, if we don't have consensus on this, I'll probably
just carry a Fedora patch like so:
- int fd = open("/proc/self/oom_adj", O_WRONLY, 0);
+ int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0);
But it seems a tad silly to be carrying such a patch for a block of
code that is only of interest to Linux packagers anyway, and nearly
all such packagers are facing this same issue either now or in the
very near future.
regards, tom lane