Heavy load on DB Cluster

Started by KK CHN4 months ago2 messagesgeneral

kkchn.in@gmail.com

4 months ago

List,

I am experiencing heavy load on my database cluster and DB server
performance degrading over the time.
vCPUs 16 , Mem 32 G Swap : 8G storage 5T RHEL 9.4 postgres 16

top - 11:55:18 up 175 days, 7:52, 3 users, load average: 11.07, 10.05,
9.56
Tasks: 731 total, 14 running, 717 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.8 us, 9.3 sy, 0.0 ni, 44.9 id, 13.7 wa, 0.8 hi, 2.5 si,
0.0 st
MiB Mem : 31837.6 total, 531.8 free, 14773.3 used, 25392.0 buff/cache
MiB Swap: 8060.0 total, 5140.4 free, 2919.6 used. 17064.2 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
3148533 postgre+ 20 0 8973544 7.3g 7.3g S 32.9 23.5 0:22.52
postgres
3150012 postgre+ 20 0 8991380 7.4g 7.4g S 32.9 23.9 0:26.16
postgres
3081907 postgre+ 20 0 9078400 8.0g 7.9g R 21.9 25.8 2:44.53
postgres
3125409 postgre+ 20 0 9075568 8.1g 8.0g S 21.3 26.0 2:34.63
postgres
3126500 postgre+ 20 0 9073928 8.0g 7.9g S 18.3 25.7 2:33.10
postgres
3081925 postgre+ 20 0 9059088 8.2g 8.2g S 17.6 26.5 6:38.79
postgres

I have pgbackrest(2.52.1) running for incremental backups to a remote
reposerver and local WAL replication configured to an onprem standalone
instance on another VM in the same local LAN.

archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
# (empty string indicates archive_command
should
# be used)
archive_command = 'pgbackrest --stanza=My_Repo archive-push %p'

*To identify the resource consuming queries I ran and found only one [40
days 17:22:59.029204 | START_REPLICATION 8E ] *and rest all seems normal
.

How come this START_REPLICATION running for 40 days and 17:30 Hrs Is this
normal ?? * Is this due to pgbackrest or WAL replication *to local
VM ?

What may be the issue and how to resolve it ?

Any hints much appreciated.. Please see the below pasted outputs for more
information.

Thank you,
Krishane

Any more tests I need to perform let me know, I can produce those
information also.

postgres=# SELECT pid, now() - query_start AS duration, query, state FROM
pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;

pid | duration |

query

| state
---------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------
------------------+---------------------
2653841 | 40 days 17:22:59.029204 | START_REPLICATION 8EF/40000000
TIMELINE 1

| active
3119645 | 00:00:00.454917 | select
easi1_0.signal_id,easi1_0.action_list,easi1_0.additional_info,easi1_0.address,easi1_0.alloc_voip_extn,easi1_0.app_version,easi1_0.caller_name,easi1_0.caller_no,easi1
_0.close_remarks,easi1_0.count,easi1_0.device_info,easi1_0.district_code,easi1_0.emergency_signal,easi1_0.event_type,easi1_0.gender,easi1_0.gps_accuracy,easi1_0.imei_no,easi1_0.informed_officers,easi1_0.invoke
_id,easi1_0.is_shout,easi1_0.last_update_time,easi1_0.latitude,easi1_0.longitude,easi1_0.place,easi1_0.ps_code,easi1_0.receive_time,easi1_0.es_signal_id,easi1_0.rescuer_count,easi1_0.service,easi1_0.signal_s
tatus,easi1_0.signal_type,easi1_0.silent_communication,easi1_0.source_type,easi1_0.state_code,easi1_0.is_valid_gps
from es_app.es_app_signal_info easi1_0 where easi1_0.caller_no=$1 and
easi1_0.imei_no=$2 a
nd easi1_0.source_type=$3 order by easi1_0.last_update_time desc fetch
first $4 rows only
| active
.............................
.............................................

Laurenz Albe

laurenz.albe@cybertec.at

4 months ago

In reply to: KK CHN (#1)

Re: Heavy load on DB Cluster

On Thu, 2026-03-05 at 12:30 +0530, KK CHN wrote:

I am experiencing heavy load on my database cluster and DB server performance degrading over the time.
vCPUs 16 , Mem 32 G Swap : 8G storage 5T RHEL 9.4 postgres 16

top - 11:55:18 up 175 days, 7:52, 3 users, load average: 11.07, 10.05, 9.56
Tasks: 731 total, 14 running, 717 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.8 us, 9.3 sy, 0.0 ni, 44.9 id, 13.7 wa, 0.8 hi, 2.5 si, 0.0 st
MiB Mem : 31837.6 total, 531.8 free, 14773.3 used, 25392.0 buff/cache
MiB Swap: 8060.0 total, 5140.4 free, 2919.6 used. 17064.2 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3148533 postgre+ 20 0 8973544 7.3g 7.3g S 32.9 23.5 0:22.52 postgres
3150012 postgre+ 20 0 8991380 7.4g 7.4g S 32.9 23.9 0:26.16 postgres
3081907 postgre+ 20 0 9078400 8.0g 7.9g R 21.9 25.8 2:44.53 postgres
3125409 postgre+ 20 0 9075568 8.1g 8.0g S 21.3 26.0 2:34.63 postgres
3126500 postgre+ 20 0 9073928 8.0g 7.9g S 18.3 25.7 2:33.10 postgres
3081925 postgre+ 20 0 9059088 8.2g 8.2g S 17.6 26.5 6:38.79 postgres

I have pgbackrest(2.52.1) running for incremental backups to a remote reposerver
and local WAL replication configured to an onprem standalone instance on
another VM in the same local LAN.

archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
# (empty string indicates archive_command should
# be used)
archive_command = 'pgbackrest --stanza=My_Repo archive-push %p'

To identify the resource consuming queries I ran and found only one
[40 days 17:22:59.029204 | START_REPLICATION 8E ] and rest all seems normal .

How come this START_REPLICATION running for 40 days and 17:30 Hrs Is this normal ??
Is this due to pgbackrest or WAL replication to local VM ?

That is an active replication - or pg_receivewal, which amounts to the same.
That's not really a query; the standby is streaming WAL from the primary and
has been doing that for over 40 days. Nothing to worry about.

What may be the issue and how to resolve it ?

For that, configure pg_stat_statements and use it to find your most time-consuming
statements.

Yours,
Laurenz Albe