Failing to allocate memory when I think it shouldn't

Started by Thomas Zieglerover 1 year ago6 messagesgeneral
Jump to latest
#1Thomas Ziegler
thomas.ziegler@holmsecurity.com

Hello Community!

I have had my database killed by the kernel oom-killer. After that I set turned off memory over-committing and that is where things got weird.

I have a PostgreSQL 16.2 running on a Debian Bookworm. The machine has 64GB of RAM and runs in a VM in an OpenStack environment.
I have `shared_buffers` at `16000MB`, `work_mem` at `80MB`, `temp_buffers` at `8MB`, `max_connections` at `300` and `maintenance_work_mem` at `1GB`.
So all in all, I get to roughly 42GB of max memory usage (`16000+(80+8)*300=42400`).

But now, I have connections fail because they can't allocate enough memory. I thought they should start creating temporary files if that is the case (and I tested this with trivial examples in psql) but the connections from my applications sometimes fail. I suspect that the queries are so large, that the execution-plans etc. are too large to fit into the 80MB of RAM. This is the "simpler" one of my problems, but it would be nice if somebody could tell me if my suspicion makes sense.

The harder issue is, that in multiple instances those memory-issues actually lead to the database "crashing" and having to recover from the redo-log. To be precise, the master-process doesn't crash and the database automatically starts the recovery (in contrast to the oom-kill, where I had to restart the database manually).
I can't get to the bottom of why that happens. At the point in time where that happens, there was more than enough usable memory on the system (>17GB) and I don't understand how it can kill the whole database if a worker-process breaks because it can't get more RAM. Also why wasn't it able to get more RAM, the kernel should have been more than happy to give it more as there was so much usable RAM available. (By "usable" I mean free + cache - shared)

Here are logs of one of those instances:

TopMemoryContext: 166384 total in 7 blocks; 33128 free (10 chunks); 133256 used
smgr relation table: 16384 total in 2 blocks; 4640 free (2 chunks); 11744 used
PgStat Shared Ref Hash: 7216 total in 2 blocks; 688 free (0 chunks); 6528 used
PgStat Shared Ref: 1024 total in 1 blocks; 328 free (0 chunks); 696 used
PgStat Pending: 4096 total in 3 blocks; 2168 free (6 chunks); 1928 used
TopTransactionContext: 8192 total in 1 blocks; 4536 free (9 chunks); 3656 used
TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used
Portal hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used
TopPortalContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
Relcache by OID: 16384 total in 2 blocks; 7712 free (2 chunks); 8672 used
CacheMemoryContext: 262144 total in 6 blocks; 160 free (2 chunks); 261984 used
index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_rewrite_rel_rulename_index
[ ... ]
WAL record construction: 50200 total in 2 blocks; 6376 free (0 chunks); 43824 used
PrivateRefCount: 8192 total in 1 blocks; 2648 free (0 chunks); 5544 used
MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
LOCALLOCK hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used
GUCMemoryContext: 24576 total in 2 blocks; 5288 free (5 chunks); 19288 used
GUC hash table: 32768 total in 3 blocks; 12704 free (5 chunks); 20064 used
Timezones: 104112 total in 2 blocks; 2648 free (0 chunks); 101464 used
Postmaster: 8192 total in 1 blocks; 6496 free (10 chunks); 1696 used
ident parser context: 1024 total in 1 blocks; 760 free (0 chunks); 264 used
hba parser context: 17408 total in 5 blocks; 6944 free (7 chunks); 10464 used
ErrorContext: 8192 total in 1 blocks; 7928 free (6 chunks); 264 used
Grand total: 889064 bytes in 124 blocks; 180840 free (113 chunks); 708224 used
2024-09-12 05:18:36.064 UTC [3808080] _@_ FATAL: out of memory
2024-09-12 05:18:36.064 UTC [3808080] _@_ DETAIL: Failed on request of size 190 in memory context "CacheMemoryContext".
TopMemoryContext: 166384 total in 7 blocks; 33128 free (10 chunks); 133256 used
[ ... ]
2024-09-12 05:18:36.065 UTC [3808075] _@_ FATAL: out of memory
2024-09-12 05:18:36.065 UTC [3808075] _@_ DETAIL: Failed on request of size 190 in memory context "CacheMemoryContext".
2024-09-12 05:18:36.065 UTC [3808076] ERROR: out of memory
2024-09-12 05:18:36.065 UTC [3808076] DETAIL: Failed on request of size 32800 in memory context "HashBatchContext".
2024-09-12 05:18:36.065 UTC [3805724] _@_ ERROR: out of memory
2024-09-12 05:18:36.065 UTC [3805724] _@_ DETAIL: Failed on request of size 32800 in memory context "HashBatchContext".
2024-09-12 05:18:36.065 UTC [3805724] _@_ CONTEXT: parallel worker
2024-09-12 05:18:36.065 UTC [3808078] _@_ FATAL: out of memory
2024-09-12 05:18:36.065 UTC [3808078] _@_ DETAIL: Failed on request of size 190 in memory context "CacheMemoryContext".
2024-09-12 05:18:36.065 UTC [3808081] _@_ FATAL: out of memory
2024-09-12 05:18:36.065 UTC [3808081] _@_ DETAIL: Failed on request of size 40992 in memory context "TopMemoryContext".
2024-09-12 05:18:36.066 UTC [1932776] LOG: could not fork new process for connection: Cannot allocate memory
2024-09-12 05:18:36.066 UTC [3808050] _@_ ERROR: out of memory
2024-09-12 05:18:36.066 UTC [3808050] _@_ DETAIL: Failed while creating memory context "HashAgg meta context".
2024-09-12 05:18:36.067 UTC [3488999] _@_ ERROR: out of memory
2024-09-12 05:18:36.067 UTC [3488999] _@_ DETAIL: Failed on request of size 32800 in memory context "HashBatchContext".
2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel worker" (PID 3808076) exited with exit code 1
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel worker" (PID 3808077) was terminated by signal 6: Aborted
2024-09-12 05:18:36.083 UTC [1932776] LOG: terminating any other active server processes
2024-09-12 05:18:36.829 UTC [1932776] LOG: all server processes terminated; reinitializing
2024-09-12 05:18:39.731 UTC [3808861] LOG: database system was interrupted; last known up at 2024-09-12 05:08:52 UTC
2024-09-12 05:18:40.277 UTC [3808861] LOG: database system was not properly shut down; automatic recovery in progress
2024-09-12 05:18:40.303 UTC [3808861] LOG: redo starts at 3EA1/3AB0CF70
2024-09-12 05:18:50.304 UTC [3808861] LOG: redo in progress, elapsed time: 10.00 s, current LSN: 3EA1/837DA848
2024-09-12 05:18:55.612 UTC [3808861] LOG: invalid record length at 3EA1/C42A4A48: expected at least 24, got 0
2024-09-12 05:18:55.612 UTC [3808861] LOG: redo done at 3EA1/C42A4A20 system usage: CPU: user: 9.04 s, system: 3.49 s, elapsed: 15.31 s
2024-09-12 05:18:55.642 UTC [3808862] LOG: checkpoint starting: end-of-recovery immediate wait
2024-09-12 05:19:16.215 UTC [3808862] LOG: checkpoint complete: wrote 294597 buffers (14.4%); 0 WAL file(s) added, 0 removed, 105 recycled; [ ... ]
2024-09-12 05:19:16.340 UTC [1932776] LOG: database system is ready to accept connections

I would be eternally grateful if somebody could explain to me what is going on there. As I wrote earlier, there should have been more than enough RAM available on the system, so I don't see any reason why the kernel would deny any memory allocations.
The only related item I was able to find is this thread from 2018: /messages/by-id/e47eec7f-9dbc-cbe3-dd01-32cbee90f332@neteven.com

Cheers,
Thomas

#2Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Thomas Ziegler (#1)
Re: Failing to allocate memory when I think it shouldn't

Hi,

## Thomas Ziegler (thomas.ziegler@holmsecurity.com):

There's a lot of information missing here. Let's start from the top.

I have had my database killed by the kernel oom-killer. After that I
set turned off memory over-committing and that is where things got weird.

What exactly did you set? When playing with vm.overcommit, did you
understand "Committed Address Space" and the workings of the
overcommit accounting? This is the document:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/mm/overcommit-accounting.rst
Hint: when setting overcommit_memory=2 you might end up with way
less available adress space than you thought you would. Also keep
an eye on /proc/meminfo - it's sometimes hard to estimate "just off
your cuff" what's in memory and how it's mapped. (Also, anything
else on that machine which might hog memory?).

I have `shared_buffers` at `16000MB`, `work_mem` at `80MB`, `temp_buffers`
at `8MB`, `max_connections` at `300` and `maintenance_work_mem` at `1GB`.
So all in all, I get to roughly 42GB of max memory usage
(`16000+(80+8)*300=42400`).

That work_mem is "per query operation", you can have multiple of
those in a single query:
https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM
Also, there's hash_mem_multiplier:
https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-HASH-MEM-MULTIPLIER
Then I've seen that your query uses parallel workers, remember that
each worker requests memory.
Next, maintenance_work_mem is a per process limit, and depending on
what's running at any given time, that can add up.

Finally, there's this:

2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel worker" (PID 3808076) exited with exit code 1
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel worker" (PID 3808077) was terminated by signal 6: Aborted

That "std::bad_alloc" sounds a lot like C++ and not like the C our
database is written in. My first suspicion would be that you're using
LLVM-JIT (unless you have other - maybe even your own - C++ extensions
in the database?) and that in itself can use a good chunk of memory.
And it looks like that exception bubbled up as a signal 6 (SIGABRT)
which made the process terminate immediately without any cleanup,
and after that the server has no other chance than to crash-restart.

I recommend starting with understanding the actual memory limits
as set by your configuration (personally I believe that memory
overcommit is less evil than some people think). Have a close look
at /proc/meminfo and if possible disable JIT and check if it changes
anything. Also if possible try starting with only a few active
connections and increase load carefully once a steady state (in
terms of memory usage) has been reached.

Regards,
Christoph

--
Spare Space

#3Thomas Ziegler
thomas.ziegler@holmsecurity.com
In reply to: Christoph Moench-Tegeder (#2)
Re: Failing to allocate memory when I think it shouldn't

Hello Christoph,

Thanks for your answer and the suggestions, it already helped me out a lot!

On 2024-09-14 22:11, Christoph Moench-Tegeder wrote:

Hi,

## Thomas Ziegler (thomas.ziegler@holmsecurity.com):

There's a lot of information missing here. Let's start from the top.

I have had my database killed by the kernel oom-killer. After that I
set turned off memory over-committing and that is where things got weird.

What exactly did you set? When playing with vm.overcommit, did you
understand "Committed Address Space" and the workings of the
overcommit accounting? This is the document:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/mm/overcommit-accounting.rst
Hint: when setting overcommit_memory=2 you might end up with way
less available adress space than you thought you would. Also keep
an eye on /proc/meminfo - it's sometimes hard to estimate "just off
your cuff" what's in memory and how it's mapped. (Also, anything
else on that machine which might hog memory?).

I set overcommit_memory=2, but completely missed 'overcommit_ratio'.
That is most probably why the database got denied the RAM a lot sooner
than I expected.

Finally, there's this:

2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel worker" (PID 3808076) exited with exit code 1
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel worker" (PID 3808077) was terminated by signal 6: Aborted

That "std::bad_alloc" sounds a lot like C++ and not like the C our
database is written in. My first suspicion would be that you're using
LLVM-JIT (unless you have other - maybe even your own - C++ extensions
in the database?) and that in itself can use a good chunk of memory.
And it looks like that exception bubbled up as a signal 6 (SIGABRT)
which made the process terminate immediately without any cleanup,
and after that the server has no other chance than to crash-restart.

Except for pgAudit, I don't have any extensions, so it is probably the
JIT. I had no idea there was a JIT, even it should have been obvious.
Thanks for pointing this out!

Is the memory the JIT takes limited by 'work_mem' or will it just take
as much memory as it needs?

I recommend starting with understanding the actual memory limits
as set by your configuration (personally I believe that memory
overcommit is less evil than some people think). Have a close look
at /proc/meminfo and if possible disable JIT and check if it changes
anything. Also if possible try starting with only a few active
connections and increase load carefully once a steady state (in
terms of memory usage) has been reached.

Yes, understanding the memory limits is what I was trying to do.
I was questioning my understanding but it seems it was Linux that
tripped me,
or more like my lack of understanding there, rather than the database.
Memory management and /proc/meminfo still manages to confuse me.

Again, thanks for your help!

Cheers,
Thomas

p.s.: To anybody who stumbles upon this in the future,
if you set `overcommit_memory=2`, don't forget `overcommit_ratio`.

#4Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Thomas Ziegler (#3)
Re: Failing to allocate memory when I think it shouldn't

Hi,

## Thomas Ziegler (thomas.ziegler@holmsecurity.com):

Except for pgAudit, I don't have any extensions, so it is probably the
JIT. I had no idea there was a JIT, even it should have been obvious.
Thanks for pointing this out!

There is - it even has it's own chapter in the documentation:
https://www.postgresql.org/docs/current/jit.html
Most importantly, you can disable JIT per session ("SET jit=off")
or globally in the configuration file (jit=off, reload is
sufficient) or with any of the other usual configuration mechanisms.
If that fixes your problem, congratulations (and the problem is
somewhere down between bytecode generation and what and how llvm
(in its particular version) generates from that).

Is the memory the JIT takes limited by 'work_mem' or will it just take
as much memory as it needs?

The latter.

Regards,
Christoph

--
Spare Space

#5Siraj G
tosiraj.g@gmail.com
In reply to: Christoph Moench-Tegeder (#4)
Re: Failing to allocate memory when I think it shouldn't

Hello Christoph

I am getting the same error in postgres 12 (sorry that our version upgrade
sucks). I see that hash_mem_multiplier is available from version 13. What
could we do in version 12?

The error is:
Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496
usedindex info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used
: pg _ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 usedt
s_Grand total: 110141936 bytes in 2085 blocks; 2842528 free (149 chunks);
107299408 used
dict_oid_index
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used:
pg_event_trigger_oid_index
index info: 3072 total in 2 blocks; 1216 free (3 chunks); 1856 used:
pg_conversion_default_index
index info: 3072 total in 2 blocks; 1136 free (2 chunks); 1936 used:
pg_operator_oprname_l_r_n_index
index info: 2048 total in 2 blocks; 680 free (2 chunks); 1368 used:
pg_trigger_tgrelid_tgname_index
index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used:
pg_enum_typid_label_index
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used:
pg_ts_config_oid_index
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used:
pg_user_mapping_oid_index
index info: 2048 total in 2 blocks; 704 free (3 chunks); 1344 used:
pg_opfamily_am_name_nsp_index
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used:
pg_foreign_table_relid_index
index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used:
pg_type_oid_index
index info: 1024 total in 1 blocks; 48 free (0 chunks); 976 used:
pg_aggregate_fnoid_index
47 more child contexts containing 80896 total in 78 blocks; 25784 free
(51 chunks); 55112 used
PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used
MdSmgr: 8192 total in 1 blocks; 5528 free (0 chunks); 2664 used
LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used
Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used
Grand total: 74714976 bytes in 1007 blocks; 2893968 free (151 chunks);
71821008 used
10.3.2.133,2025-07-30 19:03:04
IST,431246,orchids_letseduvate_db,autoscaling,1,ERROR: out of memory
10.3.2.133,2025-07-30 19:03:04
IST,431246,orchids_letseduvate_db,autoscaling,2,DETAIL: Failed on request
of size 32800 in memory context "HashBatchContext".

We have these memory settings:
work_mem=2GB
maintenance_work_mem=2GB
shared_buffers=48GB
max_parallel_workers=8

This issue is happening in the REPLICA instance.

Regards
Siraj

On Wed, Sep 18, 2024 at 12:35 AM Christoph Moench-Tegeder <
cmt@burggraben.net> wrote:

Show quoted text

Hi,

## Thomas Ziegler (thomas.ziegler@holmsecurity.com):

Except for pgAudit, I don't have any extensions, so it is probably the
JIT. I had no idea there was a JIT, even it should have been obvious.
Thanks for pointing this out!

There is - it even has it's own chapter in the documentation:
https://www.postgresql.org/docs/current/jit.html
Most importantly, you can disable JIT per session ("SET jit=off")
or globally in the configuration file (jit=off, reload is
sufficient) or with any of the other usual configuration mechanisms.
If that fixes your problem, congratulations (and the problem is
somewhere down between bytecode generation and what and how llvm
(in its particular version) generates from that).

Is the memory the JIT takes limited by 'work_mem' or will it just take
as much memory as it needs?

The latter.

Regards,
Christoph

--
Spare Space

#6Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Siraj G (#5)
Re: Failing to allocate memory when I think it shouldn't

## Siraj G (tosiraj.g@gmail.com):

I am getting the same error in postgres 12 (sorry that our version upgrade
sucks).

In all likelyhood, this is a somewhat different situation, as nothing
here points to JIT.

I see that hash_mem_multiplier is available from version 13. What
could we do in version 12?

Obviously, you could upgrade - you already identified the change which
could at least improve matters here (when you do that, don't just stop
at version 13: that is going EOL in a few months, too; and the whole
hash memory allocation got some more polishing in 15).

Also, as you got a standard out-of-memory error and not the dreaded
OOM-kill, I assume that you set vm.overcommit_memory to a non-default
(that is, != 0) value, but you did not mention anything about setting
overcommit_ratio or overcommit_bytes - see upthread for a link to the
documentation. Make sure that you have reasonable settings here.

Thirdly, you should check whether you memory settings are actually
suitable for your workload and machine, considering concurrent
operations and their memory usage against actually available memory.

Then check if that specific hash join grows beyond reasonable size
(can you even execute the statement with the given parameters on
an idle system?), and if it does investigate why the hash map is
so much larger than estimated. You might have a statistics problem.
You could try to reduce work_mem to steer the planner away from
that hash join to a less memory-intensive strategy, or you could
temporarily disable hash joins all together and analyse the impact
on your application.

Regards,
Christoph

--
Spare Space