Heavy load on DB Cluster

Started by KK CHNabout 1 month ago2 messagesgeneral
Jump to latest
#1KK CHN
kkchn.in@gmail.com

List,

I am experiencing heavy load on my database cluster and DB server
performance degrading over the time.
vCPUs 16 , Mem 32 G Swap : 8G storage 5T RHEL 9.4 postgres 16

top - 11:55:18 up 175 days, 7:52, 3 users, load average: 11.07, 10.05,
9.56
Tasks: 731 total, 14 running, 717 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.8 us, 9.3 sy, 0.0 ni, 44.9 id, 13.7 wa, 0.8 hi, 2.5 si,
0.0 st
MiB Mem : 31837.6 total, 531.8 free, 14773.3 used, 25392.0 buff/cache
MiB Swap: 8060.0 total, 5140.4 free, 2919.6 used. 17064.2 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
3148533 postgre+ 20 0 8973544 7.3g 7.3g S 32.9 23.5 0:22.52
postgres
3150012 postgre+ 20 0 8991380 7.4g 7.4g S 32.9 23.9 0:26.16
postgres
3081907 postgre+ 20 0 9078400 8.0g 7.9g R 21.9 25.8 2:44.53
postgres
3125409 postgre+ 20 0 9075568 8.1g 8.0g S 21.3 26.0 2:34.63
postgres
3126500 postgre+ 20 0 9073928 8.0g 7.9g S 18.3 25.7 2:33.10
postgres
3081925 postgre+ 20 0 9059088 8.2g 8.2g S 17.6 26.5 6:38.79
postgres

I have pgbackrest(2.52.1) running for incremental backups to a remote
reposerver and local WAL replication configured to an onprem standalone
instance on another VM in the same local LAN.

archive_mode = on # enables archiving; off, on, or always
# (change requires restart)
# (empty string indicates archive_command
should
# be used)
archive_command = 'pgbackrest --stanza=My_Repo archive-push %p'

*To identify the resource consuming queries I ran and found only one [40
days 17:22:59.029204 | START_REPLICATION 8E ] *and rest all seems normal
.

How come this START_REPLICATION running for 40 days and 17:30 Hrs Is this
normal ?? * Is this due to pgbackrest or WAL replication *to local
VM ?

What may be the issue and how to resolve it ?

Any hints much appreciated.. Please see the below pasted outputs for more
information.

Thank you,
Krishane

Any more tests I need to perform let me know, I can produce those
information also.

postgres=# SELECT pid, now() - query_start AS duration, query, state FROM
pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;

pid | duration |

query

| state
---------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------
------------------+---------------------
2653841 | 40 days 17:22:59.029204 | START_REPLICATION 8EF/40000000
TIMELINE 1

| active
3119645 | 00:00:00.454917 | select
easi1_0.signal_id,easi1_0.action_list,easi1_0.additional_info,easi1_0.address,easi1_0.alloc_voip_extn,easi1_0.app_version,easi1_0.caller_name,easi1_0.caller_no,easi1
_0.close_remarks,easi1_0.count,easi1_0.device_info,easi1_0.district_code,easi1_0.emergency_signal,easi1_0.event_type,easi1_0.gender,easi1_0.gps_accuracy,easi1_0.imei_no,easi1_0.informed_officers,easi1_0.invoke
_id,easi1_0.is_shout,easi1_0.last_update_time,easi1_0.latitude,easi1_0.longitude,easi1_0.place,easi1_0.ps_code,easi1_0.receive_time,easi1_0.es_signal_id,easi1_0.rescuer_count,easi1_0.service,easi1_0.signal_s
tatus,easi1_0.signal_type,easi1_0.silent_communication,easi1_0.source_type,easi1_0.state_code,easi1_0.is_valid_gps
from es_app.es_app_signal_info easi1_0 where easi1_0.caller_no=$1 and
easi1_0.imei_no=$2 a
nd easi1_0.source_type=$3 order by easi1_0.last_update_time desc fetch
first $4 rows only
| active
.............................
.............................................

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: KK CHN (#1)
Re: Heavy load on DB Cluster

On Thu, 2026-03-05 at 12:30 +0530, KK CHN wrote:

I am experiencing heavy load on my database cluster and DB server performance degrading over the time.
vCPUs 16 ,  Mem 32 G   Swap : 8G   storage 5T       RHEL 9.4  postgres 16

top - 11:55:18 up 175 days,  7:52,  3 users,  load average: 11.07, 10.05, 9.56
Tasks: 731 total,  14 running, 717 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.8 us,  9.3 sy,  0.0 ni, 44.9 id, 13.7 wa,  0.8 hi,  2.5 si,  0.0 st
MiB Mem :  31837.6 total,    531.8 free,  14773.3 used,  25392.0 buff/cache
MiB Swap:   8060.0 total,   5140.4 free,   2919.6 used.  17064.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3148533 postgre+  20   0 8973544   7.3g   7.3g S  32.9  23.5   0:22.52 postgres
3150012 postgre+  20   0 8991380   7.4g   7.4g S  32.9  23.9   0:26.16 postgres
3081907 postgre+  20   0 9078400   8.0g   7.9g R  21.9  25.8   2:44.53 postgres
3125409 postgre+  20   0 9075568   8.1g   8.0g S  21.3  26.0   2:34.63 postgres
3126500 postgre+  20   0 9073928   8.0g   7.9g S  18.3  25.7   2:33.10 postgres
3081925 postgre+  20   0 9059088   8.2g   8.2g S  17.6  26.5   6:38.79 postgres 

I have pgbackrest(2.52.1) running for incremental backups to a remote reposerver
and local   WAL replication configured to an onprem standalone  instance on
another VM in the same local LAN. 

archive_mode = on               # enables archiving; off, on, or always
                                # (change requires restart)
                                # (empty string indicates archive_command should
                                # be used)
archive_command = 'pgbackrest --stanza=My_Repo archive-push %p'

To identify the resource consuming queries I ran  and found  only one
[40 days 17:22:59.029204 | START_REPLICATION 8E ]  and rest all seems normal .   

How come this START_REPLICATION running for 40 days and 17:30 Hrs Is this normal ??
Is this due to pgbackrest    or   WAL replication to local VM ?

That is an active replication - or pg_receivewal, which amounts to the same.
That's not really a query; the standby is streaming WAL from the primary and
has been doing that for over 40 days. Nothing to worry about.

What may be the issue and how to resolve it ?

For that, configure pg_stat_statements and use it to find your most time-consuming
statements.

Yours,
Laurenz Albe