Help in dealing with OOM

Started by Siraj Gover 1 year ago3 messagesgeneral
Jump to latest
#1Siraj G
tosiraj.g@gmail.com

Hello Experts!

My secondary instance has been unstable today. The service is crashing with
Out of Memory. Please see below error
(/var/log/postgresql/postgresql-2024-10-14.log):

10.2.52.50,2024-10-14 15:28:11 IST,686671,finance_revamp,finance_b2b,1,LOG:
duration: 1148.527 ms statement: SELECT "tripschedule"."id",
"tripschedule"."name", "tripschedule"."branch_id",
"tripschedule"."route_id", "tripschedule"."route_name",
"tripschedule"."bus_id", "tripschedule"."path",
"tripschedule"."path_timings", "tripschedule"."recurring_days",
"tripschedule"."start_time", "tripschedule"."end_time",
"tripschedule"."start_date", "tripschedule"."end_date",
"tripschedule"."created_at", "tripschedule"."created_by",
"tripschedule"."shift", "tripschedule"."is_deleted",
"tripschedule"."is_active", "tripschedule"."is_cancelled",
"tripschedule"."branch_latitude", "tripschedule"."branch_longitude",
"tripschedule"."polygon_id", "tripschedule"."is_after_noon_shift" FROM
"tripschedule" INNER JOIN "bus" ON ("tripschedule"."bus_id" = "bus"."id")
WHERE ("bus"."vehicle_no" = 'KA51AH1922' AND "tripschedule"."end_date" >=
'2024-10-14'::date AND "tripschedule"."is_active" AND
"tripschedule"."recurring_days" && ARRAY[1]::integer[] AND
"tripschedule"."start_date" <= '2024-10-14'::date AND
("tripschedule"."start_time" BETWEEN '14:57:57.654167'::time AND
'15:57:57.654167'::time OR "tripschedule"."end_time" BETWEEN
'14:57:57.654167'::time AND '15:57:57.654167'::time OR
("tripschedule"."start_time" <= '15:27:57.654167'::time AND
"tripschedule"."end_time" >= '15:27:57.654167'::time))) ORDER BY
"tripschedule"."id" DESC LIMIT 1
10.2.52.22,2024-10-14 15:28:11
IST,686748,orchids_letseduvate_db,autoscaling,1,LOG: duration: 468.028 ms
statement: SELECT (1) AS "a" FROM "test" INNER JOIN "test_section_mapping"
ON ("test"."id" = "test_section_mapping"."test_id") INNER JOIN
"test_subjects" ON ("test"."id" = "test_subjects"."test_id") INNER JOIN
"user_response" ON ("test"."id" = "user_response"."test_id") WHERE
(("test"."test_date")::date >= '2024-10-14'::date AND
("test"."test_date")::date <= '2024-10-17'::date AND NOT "test"."is_delete"
AND "test_section_mapping"."sectionmapping_id" IN (136364) AND
"test_subjects"."subject_id" = 16 AND NOT "user_response"."is_delete" AND
"user_response"."submitted_by_id" = 61725) LIMIT 1
,2024-10-14 15:28:11 IST,2334064,,,8,LOG: checkpointer process (PID
2334587) was terminated by signal 9: Killed
,2024-10-14 15:28:11 IST,2334064,,,9,LOG: terminating any other active
server processes
10.2.52.50,2024-10-14 15:28:11
IST,686752,mcollege_letseduvate_db,finance_b2b,1,WARNING: terminating
connection because

This is from the OS log (/var/log/kern.log):

oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.sli
ce/system-postgresql.slice/postgresql@12-main.service
,task=postgres,pid=2334587,uid=114
494 Oct 14 09:58:10 gce-k12-prod-as1-erp-pg-secondary kernel:
[6905020.514569] Out of memory: Killed process 2334587 (postgres)
total-vm:26349584kB, anon-rss:3464kB, file-rss:0kB, shmem-rs
s:21813032kB, UID:114 pgtables:49024kB oom_score_adj:0

Regards
Siraj

#2Joe Conway
mail@joeconway.com
In reply to: Siraj G (#1)
Re: Help in dealing with OOM

On 10/14/24 14:37, Siraj G wrote:

This is from the OS log (/var/log/kern.log):

oom-
kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.sli      ce/system-postgresql.slice/postgresql@12-main.service,task=postgres,pid=2334587,uid=114
  494 Oct 14 09:58:10 gce-k12-prod-as1-erp-pg-secondary kernel:
[6905020.514569] Out of memory: Killed process 2334587 (postgres) total-
vm:26349584kB, anon-rss:3464kB, file-rss:0kB, shmem-rs
 s:21813032kB, UID:114 pgtables:49024kB oom_score_adj:0

1. Do you happen to have swap disabled? If so, don't do that.

2. Does the postgres cgroup have memory.limit (cgroup v1) or memory.max
(cgroup v2) set?

3. If #2 answer is no, have you followed the documented guidance here
(in particular vm.overcommit_memory=2):

https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#3Siraj G
tosiraj.g@gmail.com
In reply to: Joe Conway (#2)
Re: Help in dealing with OOM

Thanks Joe, I will set these kernel parameters.

I also would like to highlight that the issue happened on SECONDARY. While
the PRIMARY has less memory and computation in comparison to SECONDARY, not
sure if there is anything wrong in the PgSQL.

PRIMARY: 48vCPUs & 48GB memory
SECONDARY: 64vCPUs & 64GB memory

I noticed a few things which do not sound tidy:
1. Total number of DBs are: 1860 (DB environment serves a product that has
tenants - around 1100 tenants which means these many DBs are active)
: Is there any metric for optimal performance on the number of DBs we
should have per instance? I would assume NO (and it should be purely based
on the overall operations), but just a question out of curiosity.
2. max_connections is set to 10000.
I tried to reduce it to 4000 but was unable to do so (I tried this after
reducing the max_connections in PRIMARY to 4000). This is the error:
FATAL: hot standby is not possible because max_connections = 4000 is a
lower setting than on the master server (its value was 10000)

If I am clubbing multiple things, sorry for the clutter.

Regards
Siraj

On Tue, Oct 15, 2024 at 12:39 AM Joe Conway <mail@joeconway.com> wrote:

Show quoted text

On 10/14/24 14:37, Siraj G wrote:

This is from the OS log (/var/log/kern.log):

oom-

kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.sli
ce/system-postgresql.slice/postgresql@12-main.service
,task=postgres,pid=2334587,uid=114

494 Oct 14 09:58:10 gce-k12-prod-as1-erp-pg-secondary kernel:
[6905020.514569] Out of memory: Killed process 2334587 (postgres) total-
vm:26349584kB, anon-rss:3464kB, file-rss:0kB, shmem-rs
s:21813032kB, UID:114 pgtables:49024kB oom_score_adj:0

1. Do you happen to have swap disabled? If so, don't do that.

2. Does the postgres cgroup have memory.limit (cgroup v1) or memory.max
(cgroup v2) set?

3. If #2 answer is no, have you followed the documented guidance here
(in particular vm.overcommit_memory=2):

https://www.postgresql.org/docs/12/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com