Linux OOM killer

Started by Ariel Tejeraover 1 year ago6 messagesbugs
Jump to latest
#1Ariel Tejera
artejera@gmail.com

Hi. I hope this message finds you well.

The issue is that one of our Postgres servers hit a bug and was killed by
linux OOM, as shown in the lines below, showing two events:

[image: image.png]

We were able to fix this problem adjusting the server configuration with:
enable_memoize = off

Our Postgres version is 14.5
Linux AWS linux2 (with diverse concurrent workloads)
Ram 32GB
Database size 200 GB

The issue was internally documented in this link:

Postgres failure 2024-09-20
<https://drive.google.com/open?id=1FkHAqVkPmC_jT6ugYEsGX0ZrzXMMLSMarf_8V02E2qg&gt;

This is the first reproducible bug I've found in 20 years using postgres,
heavily (!)

As this bug is associated with large databases, it is impractical to offer
a reproducible example for it. We hope, however, that this report will be
of some use for the Postgres project.

Yours,
Ariel Tejera Molina
Technical Director
ITComplements SA de CV

Attachments:

image.pngimage/png; name=image.pngDownload
#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Ariel Tejera (#1)
Re: Linux OOM killer

On Tue, Oct 1, 2024 at 11:44 AM Ariel Tejera <artejera@gmail.com> wrote:

We were able to fix this problem adjusting the server configuration with:
enable_memoize = off

Our Postgres version is 14.5

Which is over two years out-of-date. As this is apparently reproducible
I'm sure someone here will try, but reporting bugs against out-of-support
versions is not the best idea. Better to confirm they still exist on
current versions yourself before reporting.

The issue was internally documented in this link:

Postgres failure 2024-09-20
<https://drive.google.com/open?id=1FkHAqVkPmC_jT6ugYEsGX0ZrzXMMLSMarf_8V02E2qg&gt;

Please send any such material inline with the email so it survives in the
archive.

David J.

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Ariel Tejera (#1)
Re: Linux OOM killer

On Tue, 2024-10-01 at 12:17 -0600, Ariel Tejera wrote:

Hi.  I hope this message finds you well.

The issue is that one of our Postgres servers hit a bug and was killed by linux OOM, as
shown in the lines below, showing two events:

We were able to fix this problem adjusting the server configuration with:
enable_memoize = off

Our Postgres version is 14.5 
Linux AWS linux2 (with diverse concurrent workloads)
Ram 32GB
Database size 200 GB 

This is the first reproducible bug I've found in 20 years using postgres, heavily (!)

As this bug is associated with large databases, it is impractical to offer a reproducible example for it.
We hope, however, that this report will be of some use for the Postgres project.

First of all, update to 14.latest. I find at least one bug fixed in this area:
https://postgr.es/c/e4b95b9b02, discussed in /messages/by-id/83281eed63c74e4f940317186372abfd@cft.ru

Then, disable memory overcommit, so that you don't get killed by the OOM killer.
Then you will get an "out of memory" error and a memory context dump in the log.
We'd need to see that to figure out if it really is a bug.

It need not be a bug if you run out of memory. It might as well be that you
configured PostgreSQL too generously.

Yours,
Laurenz Albe

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Laurenz Albe (#3)
Re: Linux OOM killer

On 10/2/24 06:16, Laurenz Albe wrote:

On Tue, 2024-10-01 at 12:17 -0600, Ariel Tejera wrote:

Hi.  I hope this message finds you well.

The issue is that one of our Postgres servers hit a bug and was killed by linux OOM, as
shown in the lines below, showing two events:

We were able to fix this problem adjusting the server configuration with:
enable_memoize = off

Our Postgres version is 14.5 
Linux AWS linux2 (with diverse concurrent workloads)
Ram 32GB
Database size 200 GB 

This is the first reproducible bug I've found in 20 years using postgres, heavily (!)

As this bug is associated with large databases, it is impractical to offer a reproducible example for it.
We hope, however, that this report will be of some use for the Postgres project.

First of all, update to 14.latest. I find at least one bug fixed in this area:
https://postgr.es/c/e4b95b9b02, discussed in /messages/by-id/83281eed63c74e4f940317186372abfd@cft.ru

Then, disable memory overcommit, so that you don't get killed by the OOM killer.
Then you will get an "out of memory" error and a memory context dump in the log.
We'd need to see that to figure out if it really is a bug.

FWIW I don't think anyone can investigate this without more information.
In particular, we'd need the query plan triggering the issue, with info
about the schema (which data types, ...) and data sizes. And the memory
context information - either logged during OOM, or collected using gdb.

But yeah, definitely update to newest 14.x first. Chances are this is
already fixed.

regards

--
Tomas Vondra

#5Ariel Tejera
artejera@gmail.com
In reply to: Tomas Vondra (#4)
Re: Linux OOM killer

Hi,

Right .. I'll try to upgrade versions and then retry, as you recommend,
unfortunately we're short of hands at the moment.
For us the issue is in practice solved with memoizing=off
Yours,
Ariel Tejera

On Wed, Oct 2, 2024 at 2:22 AM Tomas Vondra <tomas@vondra.me> wrote:

Show quoted text

On 10/2/24 06:16, Laurenz Albe wrote:

On Tue, 2024-10-01 at 12:17 -0600, Ariel Tejera wrote:

Hi. I hope this message finds you well.

The issue is that one of our Postgres servers hit a bug and was killed

by linux OOM, as

shown in the lines below, showing two events:

We were able to fix this problem adjusting the server configuration

with:

enable_memoize = off

Our Postgres version is 14.5
Linux AWS linux2 (with diverse concurrent workloads)
Ram 32GB
Database size 200 GB

This is the first reproducible bug I've found in 20 years using

postgres, heavily (!)

As this bug is associated with large databases, it is impractical to

offer a reproducible example for it.

We hope, however, that this report will be of some use for the Postgres

project.

First of all, update to 14.latest. I find at least one bug fixed in

this area:

https://postgr.es/c/e4b95b9b02, discussed in

/messages/by-id/83281eed63c74e4f940317186372abfd@cft.ru

Then, disable memory overcommit, so that you don't get killed by the OOM

killer.

Then you will get an "out of memory" error and a memory context dump in

the log.

We'd need to see that to figure out if it really is a bug.

FWIW I don't think anyone can investigate this without more information.
In particular, we'd need the query plan triggering the issue, with info
about the schema (which data types, ...) and data sizes. And the memory
context information - either logged during OOM, or collected using gdb.

But yeah, definitely update to newest 14.x first. Chances are this is
already fixed.

regards

--
Tomas Vondra

#6David Rowley
dgrowleyml@gmail.com
In reply to: Ariel Tejera (#5)
Re: Linux OOM killer

On Thu, 3 Oct 2024 at 07:16, Ariel Tejera <artejera@gmail.com> wrote:

Right .. I'll try to upgrade versions and then retry, as you recommend, unfortunately we're short of hands at the moment.
For us the issue is in practice solved with memoizing=off

Upgrading minor versions is a trivial task, and it's one you should
give much higher priority to than you have been.

For the reported memory leak, if you look at the release notes for PG
14.8 [1]https://www.postgresql.org/docs/release/14.8/, you'll see:

"Fix memory leak in Memoize plan execution (David Rowley)"

It's quite likely this will fix the issue you reported. If it doesn't,
please feel free to update us and we can look further. Unfortunately,
we've no means to time travel back to fix these bugs in the past, so
as a workaround, we release minor versions to fix discovered bugs.
It's a good idea to upgrade to the latest minor versions for your
given release shortly after these are released. That's approximately
every 3 months.

There's more information about the project's versioning policy in [2]https://www.postgresql.org/support/versioning/.

David

[1]: https://www.postgresql.org/docs/release/14.8/
[2]: https://www.postgresql.org/support/versioning/