Kubernetes, cgroups v2 and OOM killer - how to avoid?

Started by Ancoron Luciferisabout 1 year ago7 messagesgeneral
Jump to latest
#1Ancoron Luciferis
ancoron.luciferis@googlemail.com

Hi,

I've been investigating this topic every now and then but to this day
have not come to a setup that consistently leads to a PostgreSQL backend
process receiving an allocation error instead of being killed externally
by the OOM killer.

Why this is a problem for me? Because while applications are accessing
their DBs (multiple services having their own DBs, some high-frequency),
the whole server goes into recovery and kills all backends/connections.

While my applications are written to tolerate that, it also means that
at that time, esp. for the high-frequency apps, events are piling up,
which then leads to a burst as soon as connectivity is restored. This in
turn leads to peaks in resource usage in other places (event store,
in-memory buffers from apps, ...), which sometimes leads to a series of
OOM killer events being triggered, just because some analytics query
went overboard.

Ideally, I'd find a configuration that only terminates one backend but
leaves the others working.

I am wondering whether there is any way to receive a real ENOMEM inside
a cgroup as soon as I try to allocate beyond its memory.max, instead of
relying on the OOM killer.

I know the recommendation is to have vm.overcommit_memory set to 2, but
then that affects all workloads on the host, including critical infra
like the kubelet, CNI, CSI, monitoring, ...

I have already gone through and tested the obvious:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

And yes, I know that Linux cgroups v2 memory.max is not an actual hard
limit:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files

Any help is greatly appreciated!

Cheers,

Ancoron

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Ancoron Luciferis (#1)
Re: Kubernetes, cgroups v2 and OOM killer - how to avoid?

On Sat, 2025-04-05 at 13:53 +0200, Ancoron Luciferis wrote:

I've been investigating this topic every now and then but to this day
have not come to a setup that consistently leads to a PostgreSQL backend
process receiving an allocation error instead of being killed externally
by the OOM killer.

Why this is a problem for me? Because while applications are accessing
their DBs (multiple services having their own DBs, some high-frequency),
the whole server goes into recovery and kills all backends/connections.

You don't have to explain why that is a problem. It clearly is!

Ideally, I'd find a configuration that only terminates one backend but
leaves the others working.

There isn't, but what you really want is:

I am wondering whether there is any way to receive a real ENOMEM inside
a cgroup as soon as I try to allocate beyond its memory.max, instead of
relying on the OOM killer.

I know the recommendation is to have vm.overcommit_memory set to 2, but
then that affects all workloads on the host, including critical infra
like the kubelet, CNI, CSI, monitoring, ...

I cannot answer your question, but I'd like to make two suggestions:

1. set the PostgreSQL configuration parameters "work_mem", "shared_buffers",
"maintenance_work_mem" and "max_connections" low enough that you don't
go out of memory. A crash is bad, but a query failing with an "out of
memory" error isn't nice either.

2. If you want to run PostgreSQL seriously in Kubernetes, put all PostgreSQL
pods on a dedicated host machine where you can disable memory overcommit.

Yours,
Laurenz Albe

#3Joe Conway
mail@joeconway.com
In reply to: Ancoron Luciferis (#1)
Re: Kubernetes, cgroups v2 and OOM killer - how to avoid?

On 4/5/25 07:53, Ancoron Luciferis wrote:

I've been investigating this topic every now and then but to this day
have not come to a setup that consistently leads to a PostgreSQL backend
process receiving an allocation error instead of being killed externally
by the OOM killer.

Why this is a problem for me? Because while applications are accessing
their DBs (multiple services having their own DBs, some high-frequency),
the whole server goes into recovery and kills all backends/connections.

While my applications are written to tolerate that, it also means that
at that time, esp. for the high-frequency apps, events are piling up,
which then leads to a burst as soon as connectivity is restored. This in
turn leads to peaks in resource usage in other places (event store,
in-memory buffers from apps, ...), which sometimes leads to a series of
OOM killer events being triggered, just because some analytics query
went overboard.

Ideally, I'd find a configuration that only terminates one backend but
leaves the others working.

I am wondering whether there is any way to receive a real ENOMEM inside
a cgroup as soon as I try to allocate beyond its memory.max, instead of
relying on the OOM killer.

I know the recommendation is to have vm.overcommit_memory set to 2, but
then that affects all workloads on the host, including critical infra
like the kubelet, CNI, CSI, monitoring, ...

I have already gone through and tested the obvious:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

Importantly vm.overcommit_memory set to 2 only matters when memory is
constrained at the host level.

As soon as you are running in a cgroup with a hard memory limit,
vm.overcommit_memory is irrelevant.

You can have terabytes of free memory on the host, but if cgroup memory
usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer
will pick the process in the cgroup with the highest oom_score and whack it.

Unfortunately there is no equivalent to vm.overcommit_memory within the
cgroup.

And yes, I know that Linux cgroups v2 memory.max is not an actual hard
limit:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files

Read that again -- memory.max *is* a hard limit (same as memory.limit in
cgv1).

"memory.max

A read-write single value file which exists on non-root cgroups. The
default is “max”.

Memory usage hard limit. This is the main mechanism to limit memory
usage of a cgroup. If a cgroup’s memory usage reaches this limit and
can’t be reduced, the OOM killer is invoked in the cgroup."

If you want a soft limit use memory.high.

"memory.high

A read-write single value file which exists on non-root cgroups. The
default is “max”.

Memory usage throttle limit. If a cgroup’s usage goes over the high
boundary, the processes of the cgroup are throttled and put under
heavy reclaim pressure.

Going over the high limit never invokes the OOM killer and under
extreme conditions the limit may be breached. The high limit should
be used in scenarios where an external process monitors the limited
cgroup to alleviate heavy reclaim pressure.

You want to be using memory.high rather than memory.max.

Also, I don't know what kubernetes recommends these days, but it used to
require you to disable swap. In more recent versions of kubernetes you
are able to run with swap enabled but I have no idea what the default is
-- make sure you run with swap enabled.

The combination of some swap being available, and the throttling under
heavy reclaim will likely mitigate your problems.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#4Ancoron Luciferis
ancoron.luciferis@googlemail.com
In reply to: Joe Conway (#3)
Re: Kubernetes, cgroups v2 and OOM killer - how to avoid?

On 2025-04-07 15:21, Joe Conway wrote:

On 4/5/25 07:53, Ancoron Luciferis wrote:

I've been investigating this topic every now and then but to this day
have not come to a setup that consistently leads to a PostgreSQL backend
process receiving an allocation error instead of being killed externally
by the OOM killer.

Why this is a problem for me? Because while applications are accessing
their DBs (multiple services having their own DBs, some high-frequency),
the whole server goes into recovery and kills all backends/connections.

While my applications are written to tolerate that, it also means that
at that time, esp. for the high-frequency apps, events are piling up,
which then leads to a burst as soon as connectivity is restored. This in
turn leads to peaks in resource usage in other places (event store,
in-memory buffers from apps, ...), which sometimes leads to a series of
OOM killer events being triggered, just because some analytics query
went overboard.

Ideally, I'd find a configuration that only terminates one backend but
leaves the others working.

I am wondering whether there is any way to receive a real ENOMEM inside
a cgroup as soon as I try to allocate beyond its memory.max, instead of
relying on the OOM killer.

I know the recommendation is to have vm.overcommit_memory set to 2, but
then that affects all workloads on the host, including critical infra
like the kubelet, CNI, CSI, monitoring, ...

I have already gone through and tested the obvious:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-
MEMORY-OVERCOMMIT

Importantly vm.overcommit_memory set to 2 only matters when memory is
constrained at the host level.

As soon as you are running in a cgroup with a hard memory limit,
vm.overcommit_memory is irrelevant.

You can have terabytes of free memory on the host, but if cgroup memory
usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer
will pick the process in the cgroup with the highest oom_score and whack
it.

Unfortunately there is no equivalent to vm.overcommit_memory within the
cgroup.

And yes, I know that Linux cgroups v2 memory.max is not an actual hard
limit:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-
v2.html#memory-interface-files

Read that again -- memory.max *is* a hard limit (same as memory.limit in
cgv1).

  "memory.max

    A read-write single value file which exists on non-root cgroups. The
    default is “max”.

    Memory usage hard limit. This is the main mechanism to limit memory
    usage of a cgroup. If a cgroup’s memory usage reaches this limit and
    can’t be reduced, the OOM killer is invoked in the cgroup."

Yes, I know it says "hard limit", but then any app still can go beyond
(might just be on me here to assume any "hard limit" to imply an actual
error when trying to go beyond). The OOM killer then will kick in
eventually, but not in any way that any process inside the cgroup could
prevent. So there is no signal that the app could react to saying "hey,
you just went beyond what you're allowed, please adjust before I kill you".

If you want a soft limit use memory.high.

  "memory.high

    A read-write single value file which exists on non-root cgroups. The
    default is “max”.

    Memory usage throttle limit. If a cgroup’s usage goes over the high
    boundary, the processes of the cgroup are throttled and put under
    heavy reclaim pressure.

    Going over the high limit never invokes the OOM killer and under
    extreme conditions the limit may be breached. The high limit should
    be used in scenarios where an external process monitors the limited
    cgroup to alleviate heavy reclaim pressure.

You want to be using memory.high rather than memory.max.

Hm, so solely relying on reclaim? I think that'll just get the whole
cgroup into ultra-slow mode and would not actually prevent too much
memory allocation. While this may work out just fine for the PostgreSQL
instance, it'll for sure have effects on the other workloads on the same
node (which I have apparently, more PG instances).

Apparently, I also don't see a way to even try this out in a Kubernetes
environment, since there doesn't seem to be a way to set this field
through some workload manifests field.

Also, I don't know what kubernetes recommends these days, but it used to
require you to disable swap. In more recent versions of kubernetes you
are able to run with swap enabled but I have no idea what the default is
-- make sure you run with swap enabled.

Yes, this is what I wanna try out next.

The combination of some swap being available, and the throttling under
heavy reclaim will likely mitigate your problems.

Thank you for your insights, I have something to think about.

Cheers,

Ancoron

#5Joe Conway
mail@joeconway.com
In reply to: Ancoron Luciferis (#4)
Re: Kubernetes, cgroups v2 and OOM killer - how to avoid?

On 4/8/25 13:58, Ancoron Luciferis wrote:

On 2025-04-07 15:21, Joe Conway wrote:

On 4/5/25 07:53, Ancoron Luciferis wrote:

I've been investigating this topic every now and then but to this day
have not come to a setup that consistently leads to a PostgreSQL backend
process receiving an allocation error instead of being killed externally
by the OOM killer.

Why this is a problem for me? Because while applications are accessing
their DBs (multiple services having their own DBs, some high-frequency),
the whole server goes into recovery and kills all backends/connections.

While my applications are written to tolerate that, it also means that
at that time, esp. for the high-frequency apps, events are piling up,
which then leads to a burst as soon as connectivity is restored. This in
turn leads to peaks in resource usage in other places (event store,
in-memory buffers from apps, ...), which sometimes leads to a series of
OOM killer events being triggered, just because some analytics query
went overboard.

Ideally, I'd find a configuration that only terminates one backend but
leaves the others working.

I am wondering whether there is any way to receive a real ENOMEM inside
a cgroup as soon as I try to allocate beyond its memory.max, instead of
relying on the OOM killer.

I know the recommendation is to have vm.overcommit_memory set to 2, but
then that affects all workloads on the host, including critical infra
like the kubelet, CNI, CSI, monitoring, ...

I have already gone through and tested the obvious:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-
MEMORY-OVERCOMMIT

Importantly vm.overcommit_memory set to 2 only matters when memory is
constrained at the host level.

As soon as you are running in a cgroup with a hard memory limit,
vm.overcommit_memory is irrelevant.

You can have terabytes of free memory on the host, but if cgroup memory
usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer
will pick the process in the cgroup with the highest oom_score and whack
it.

Unfortunately there is no equivalent to vm.overcommit_memory within the
cgroup.

And yes, I know that Linux cgroups v2 memory.max is not an actual hard
limit:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-
v2.html#memory-interface-files

Read that again -- memory.max *is* a hard limit (same as memory.limit in
cgv1).

  "memory.max

    A read-write single value file which exists on non-root cgroups. The
    default is “max”.

    Memory usage hard limit. This is the main mechanism to limit memory
    usage of a cgroup. If a cgroup’s memory usage reaches this limit and
    can’t be reduced, the OOM killer is invoked in the cgroup."

Yes, I know it says "hard limit", but then any app still can go beyond
(might just be on me here to assume any "hard limit" to imply an actual
error when trying to go beyond). The OOM killer then will kick in
eventually, but not in any way that any process inside the cgroup could
prevent. So there is no signal that the app could react to saying "hey,
you just went beyond what you're allowed, please adjust before I kill you".

No, that really is a hard limit and the OOM killer is *really* fast.
Once that is hit there is no time to intervene. The soft limit
(memory.high) is the one you want for that.

Or you can monitor PSI and try to anticipate problems, but that is
difficult at best. If you want to see how that is done, check out
senpai: https://github.com/facebookincubator/senpai/blob/main/README.md

If you want a soft limit use memory.high.

  "memory.high

    A read-write single value file which exists on non-root cgroups. The
    default is “max”.

    Memory usage throttle limit. If a cgroup’s usage goes over the high
    boundary, the processes of the cgroup are throttled and put under
    heavy reclaim pressure.

    Going over the high limit never invokes the OOM killer and under
    extreme conditions the limit may be breached. The high limit should
    be used in scenarios where an external process monitors the limited
    cgroup to alleviate heavy reclaim pressure.

You want to be using memory.high rather than memory.max.

Hm, so solely relying on reclaim? I think that'll just get the whole
cgroup into ultra-slow mode and would not actually prevent too much
memory allocation. While this may work out just fine for the PostgreSQL
instance, it'll for sure have effects on the other workloads on the same
node (which I have apparently, more PG instances).

Apparently, I also don't see a way to even try this out in a Kubernetes
environment, since there doesn't seem to be a way to set this field
through some workload manifests field.

Yeah, that part I have no idea about. I quit looking at kubernetes
related things about 3 years ago. Although, this link seems to indicate
there is a way related to how it does QoS:
https://kubernetes.io/blog/2023/05/05/qos-memory-resources/#:~:text=memory.high%20formula

Also, I don't know what kubernetes recommends these days, but it used to
require you to disable swap. In more recent versions of kubernetes you
are able to run with swap enabled but I have no idea what the default is
-- make sure you run with swap enabled.

Yes, this is what I wanna try out next.

Seriously -- this is *way* more than half the battle. If you do nothing
else, be sure to do this...

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6Dirschel, Steve
steve.dirschel@thomsonreuters.com
In reply to: Joe Conway (#5)
Postgres_fdw- User Mapping with md5-hashed password

I know I can create user steve_test with password testpassword122 as md5 by doing:

select 'md5'||md5('testpassword122steve_test'); Returns --> md5eb7e220574bf85096ee99370ad67cbd3

CREATE USER steve_test WITH PASSWORD 'md5eb7e220574bf85096ee99370ad67cbd3';

And then I can login as steve_test with password testpassword122.

I'm trying to use similar logic when creating a user mapping:

CREATE USER MAPPING FOR postgres SERVER steve_snap0 OPTIONS (user 'steve_test', password 'md5eb7e220574bf85096ee99370ad67cbd3');

When I try and import a foreign schema I get an error:

ERROR: could not connect to server "steve_snap0"

If I create the user mapping with the password:

CREATE USER MAPPING FOR postgres SERVER steve_snap0 OPTIONS (user 'steve_test', password 'testpassword122');

It works fine.

Is it not possible to use the same logic for the user mapping password that can be used when creating a user?

Thanks in advance.
This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website: https://www.thomsonreuters.com/en/resources/disclosures.html

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dirschel, Steve (#6)
Re: Postgres_fdw- User Mapping with md5-hashed password

On 4/8/25 13:00, Dirschel, Steve wrote:

I know I can create user steve_test with password testpassword122 as md5 by doing:

select 'md5'||md5('testpassword122steve_test'); Returns --> md5eb7e220574bf85096ee99370ad67cbd3

CREATE USER steve_test WITH PASSWORD 'md5eb7e220574bf85096ee99370ad67cbd3';

And then I can login as steve_test with password testpassword122.

I'm trying to use similar logic when creating a user mapping:

CREATE USER MAPPING FOR postgres SERVER steve_snap0 OPTIONS (user 'steve_test', password 'md5eb7e220574bf85096ee99370ad67cbd3');

When I try and import a foreign schema I get an error:

ERROR: could not connect to server "steve_snap0"

If I create the user mapping with the password:

CREATE USER MAPPING FOR postgres SERVER steve_snap0 OPTIONS (user 'steve_test', password 'testpassword122');

It works fine.

Is it not possible to use the same logic for the user mapping password that can be used when creating a user?

A) Short version

No you can't.

b) Long version

From here:

CREATE ROLE

https://www.postgresql.org/docs/current/sql-createrole.html

"If the presented password string is already in MD5-encrypted or
SCRAM-encrypted format, then it is stored as-is regardless of
password_encryption (since the system cannot decrypt the specified
encrypted password string, to encrypt it in a different format). This
allows reloading of encrypted passwords during dump/restore."

Whereas from here:

https://www.postgresql.org/docs/current/postgres-fdw.html

" user mapping, defined with CREATE USER MAPPING, is needed as well to
identify the role that will be used on the remote server:

CREATE USER MAPPING FOR local_user
SERVER foreign_server
OPTIONS (user 'foreign_user', password 'password');
"

In the above you are just supplying values to the connection string not
actually creating a password as in the first case.

Thanks in advance.
This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website: https://www.thomsonreuters.com/en/resources/disclosures.html

--
Adrian Klaver
adrian.klaver@aklaver.com