回复: postgres cpu 100% need help
Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
Here is the output when I strace the one of process when its cpu usages grow up to 100%:
recvfrom(10, "P\0\0\0J\0 \t\tselect o_count from tsh"..., 8192, 0, NULL, NULL) = 120
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
lseek(16, 0, SEEK_END) = 180543488
....more and more lseek(16, 0, SEEK_END) = 180543488 Occasionally select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1o_count\0\0\0\0\0\0\0\0"..., 75, 0, NULL, 0) = 75
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
recvfrom(10, "P\0\0\0\30\0select version()\0\0\0B\0\0\0\f\0\0"..., 8192, 0, NULL, NULL) = 60
sendto(10, "1\0\0\0\0042\0\0\0\4T\0\0\0 \0\1version\0\0\0\0\0\0\0\0"..., 183, 0, NULL, 0) = 183
ps:my postgres config:
autovacuum_analyze_scale_factor | 0.1
autovacuum_analyze_threshold | 50
autovacuum_freeze_max_age | 200000000
autovacuum_max_workers | 3
autovacuum_multixact_freeze_max_age | 400000000
autovacuum_naptime | 1min
autovacuum_vacuum_cost_delay | 20ms
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.2
autovacuum_vacuum_threshold | 50
checkpoint_segments | 32
checkpoint_timeout | 5min
effective_cache_size | 64GB
maintenance_work_mem | 500MB
max_connections | 2000
max_files_per_process | 1000
work_mem | 20MB
shared_buffers | 25GB
Are those problems with the configuration?
qijia.wang@melot.cn
发件人: 657985552@qq.com
发送时间: 2015-10-26 16:14
收件人: 王祺佳
主题: postgres cpu 100% need help
hello everyone:
I postgresql version is 9.3.5 ,My database is oltp server normally cpu usage is 3%~10% . Every sql is return in 20ms.Concurrently activety sql is less then 5
but sometimes my database cpu grows 90%+ ,the simple select sql use 2000+ms .
select count(*) from pg_stat_activity where state<>'idle';
count
-------
126
I strace -p 86181(the one of process when cpu is 100%)
recvfrom(10, "Q\0\0\0?select * from tshow.p_hpa"..., 8192, 0, NULL, NULL) = 64
lseek(8, 0, SEEK_END) = 32768
lseek(7, 0, SEEK_END) = 1155072
lseek(7, 0, SEEK_END) = 1155072
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
zhiwei.li
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
9.3.5 is pretty old, you should probably schedule an upgrade.
shared_buffers | 25GB
Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.
If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.
--
Bill Moran
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.9.3.5 is pretty old, you should probably schedule an upgrade.
shared_buffers | 25GB
Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.
possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)
maybe some other things I'm not thinking of.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really
bothered me a lot, so i really need your help. Here is the problem, in the
most situations the postgre database work very well, Average 3500tps/s
per day, the cpu usage of its process is 3%~10% and every query can be
responsed in less than 20ms, but sometimes the cpu usages of its process
can suddenly grow up to 90%+ , at that time a simple query can cost
2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp
server.
From all the lseeks on the same file, it looks like you are doing a
sequential scan on a table for the inside of a nested loop. What is the
query it is running, and what is the execution plan for it? To get that,
run:
explain (analyze, buffers) select o_count from ....
Cheers,
Jeff
On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really
bothered me a lot, so i really need your help. Here is the problem, in the
most situations the postgre database work very well, Average 3500tps/s per
day, the cpu usage of its process is 3%~10% and every query can be responsed
in less than 20ms, but sometimes the cpu usages of its process can suddenly
grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My
postgresql version is 9.3.5 and the database is oltp server.From all the lseeks on the same file, it looks like you are doing a
sequential scan on a table for the inside of a nested loop. What is the
query it is running, and what is the execution plan for it? To get that,
run:
this smells like thp compaction:
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Oct 27, 2015 at 4:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Oct 27, 2015 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really
bothered me a lot, so i really need your help. Here is the problem, in the
most situations the postgre database work very well, Average 3500tps/s per
day, the cpu usage of its process is 3%~10% and every query can be responsed
in less than 20ms, but sometimes the cpu usages of its process can suddenly
grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My
postgresql version is 9.3.5 and the database is oltp server.From all the lseeks on the same file, it looks like you are doing a
sequential scan on a table for the inside of a nested loop. What is the
query it is running, and what is the execution plan for it? To get that,
run:this smells like thp compaction:
drat -- sorry for partial email.
check this link:
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/
if that turns out not to be productive, need to check 'perf top', and
also rule out problems with individual queries.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thank you for your reply. i will try to change it and Waiting for it to appear again.the sql is :
select count(t.*) into o_count from tshow.res_room_weight t,tshow.res_room_info r
where t.subcatlg_id=:i_title_id
and t.roomid = r.actorid
and r.levels>=0;
tshow=> \d res_room_info;
Table "tshow.res_room_info"
Column | Type | Modifiers
----------------+-----------------------------+---------------
actorid | integer | not null
nickname | text | not null
livetype | integer |
people_inroom | integer |
poster | character varying(128) |
actor_level | integer |
operatorid | integer |
jointime | timestamp without time zone |
signtime | timestamp without time zone |
levels | integer |
note | text |
leavereason | text |
register_city | integer |
vedio_level | integer |
is_good | integer | default 0
is_display | integer |
live_starttime | timestamp without time zone |
live_endtime | timestamp without time zone |
next_starttime | timestamp without time zone |
max_count | integer | default 40000
is_recommend | integer |
icon | integer |
rich_level | integer |
type | integer |
room_mode | integer |
room_theme | text |
portrait | text |
gender | integer | default 0
tag | text |
live_poster | text |
family_id | integer |
room_lock | integer | default 0
Indexes:
"res_room_info_pkey" PRIMARY KEY, btree (actorid)
"idx_res_room_info_cityid" btree (register_city)
tshow=> \d tshow.res_room_weight
Table "tshow.res_room_weight"
Column | Type | Modifiers
--------------+-----------------------------+-----------
subcatlg_id | integer | not null
roomid | integer | not null
weight | integer | default 0
is_recommend | integer |
update_time | timestamp without time zone |
product_id | integer | default 1
create_time | timestamp without time zone |
Indexes:
"res_room_weight_pkey" PRIMARY KEY, btree (subcatlg_id, roomid) CLUSTER
tshow=> select count(*) from tshow.res_room_info ;
count
-------
22648
(1 row)
tshow=> select count(*) from tshow.res_room_weight ;
count
-------
23417
i don't see any Exclusive lock in pg_lock view .
657985552@qq.com
From: Bill Moran
Date: 2015-10-28 01:14
To: 657985552@qq.com
CC: pgsql-general
Subject: Re: [GENERAL]??: postgres cpu 100% need help
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
9.3.5 is pretty old, you should probably schedule an upgrade.
shared_buffers | 25GB
Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.
If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.
--
Bill Moran
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
Thank you for your reply.
1. I'm sure the io is not problem .but I didn't record the IO status at that time.
2.what is THP compaction ? How can i rule this out?
3. explain
tshow=> explain select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
QUERY PLAN
------------------------------------------------------------------------------------------
Aggregate (cost=24581.50..24581.51 rows=1 width=60)
-> Hash Join (cost=1737.27..24572.14 rows=3744 width=60)
Hash Cond: (r.actorid = t.roomid)
-> Seq Scan on res_room_info r (cost=0.00..22322.19 rows=22364 width=4)
Filter: (levels >= 0)
-> Hash (cost=1689.86..1689.86 rows=3793 width=64)
-> Seq Scan on res_room_weight t (cost=0.00..1689.86 rows=3793 width=64)
Filter: (subcatlg_id = 46)
4. next i will use perf top to capture during load is essential
657985552@qq.com
From: Merlin Moncure
Date: 2015-10-28 02:35
To: Bill Moran
CC: 657985552@qq.com; pgsql-general
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.9.3.5 is pretty old, you should probably schedule an upgrade.
shared_buffers | 25GB
Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.
possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)
maybe some other things I'm not thinking of.
merlin
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=24581.57..24581.58 rows=1 width=60) (actual time=112.613..112.613 rows=1 loops=1)
Buffers: shared hit=23439
-> Hash Join (cost=1737.31..24572.21 rows=3744 width=60) (actual time=16.375..111.879 rows=3730 loops=1)
Hash Cond: (r.actorid = t.roomid)
Buffers: shared hit=23439
-> Seq Scan on res_room_info r (cost=0.00..22322.20 rows=22365 width=4) (actual time=0.031..86.026 rows=22365 loops=1)
Filter: (levels >= 0)
Rows Removed by Filter: 291
Buffers: shared hit=22039
-> Hash (cost=1689.90..1689.90 rows=3793 width=64) (actual time=16.274..16.274 rows=3793 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 329kB
Buffers: shared hit=1397
-> Seq Scan on res_room_weight t (cost=0.00..1689.90 rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
Filter: (subcatlg_id = 46)
Rows Removed by Filter: 19641
Buffers: shared hit=1397
Total runtime: 112.754 ms
cost only 112.754 ms . the High load time log is :
2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20 CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms execute <unnamed>: SELECT o_count FROM tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""
657985552@qq.com
发件人: Jeff Janes
发送时间: 2015-10-28 03:33
收件人: 657985552@qq.com
抄送: pgsql-general
主题: Re: [GENERAL] 回复: postgres cpu 100% need help
On Mon, Oct 26, 2015 at 8:30 PM, 657985552@qq.com <657985552@qq.com> wrote:
Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.
From all the lseeks on the same file, it looks like you are doing a sequential scan on a table for the inside of a nested loop. What is the query it is running, and what is the execution plan for it? To get that, run:
explain (analyze, buffers) select o_count from ....
Cheers,
Jeff
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info rtshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
http://explain.depesz.com/s/oKN5
shows where the time is spent in that query, but as you say, that one
the timing was fine. there's a lot of time being spent in a seq_scan
of res_room_info.
I would have written that query as the inner join it is, but that
probably won't effect how it's executed.
select ... from tshow.res_room_weight t join tshow.res_room_info r on
(t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;
--
john r pierce, recycling bits in santa cruz
thanks , I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql
,it cost 200-300ms . but the log record it cost 2000ms
log message:
duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'
657985552@qq.com
发件人: John R Pierce
发送时间: 2015-10-28 16:28
收件人: pgsql-general
主题: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
http://explain.depesz.com/s/oKN5
shows where the time is spent in that query, but as you say, that one the timing was fine. there's a lot of time being spent in a seq_scan of res_room_info.
I would have written that query as the inner join it is, but that probably won't effect how it's executed.
select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;
--
john r pierce, recycling bits in santa cruz
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql
,it cost 200-300ms . but the log record it cost 2000ms
log message:
duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'
thanks
657985552@qq.com
发件人: 657985552@qq.com
发送时间: 2015-10-28 16:53
收件人: John R Pierce; pgsql-general
主题: Re: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
thanks , I can change the sql . Needed a little business modify in my application .but i don't know why in high load time I Executing the sql
,it cost 200-300ms . but the log record it cost 2000ms
log message:
duration: 2042.493 ms execute <unnamed>: select o_count from tshow.p_hpart_liveingroomlist_count($1)","parameters: $1 = '46'
657985552@qq.com
发件人: John R Pierce
发送时间: 2015-10-28 16:28
收件人: pgsql-general
主题: Re: [GENERAL] Re: Re: [GENERAL] 回复: postgres cpu 100% need help
On 10/28/2015 12:12 AM, 657985552@qq.com wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
http://explain.depesz.com/s/oKN5
shows where the time is spent in that query, but as you say, that one the timing was fine. there's a lot of time being spent in a seq_scan of res_room_info.
I would have written that query as the inner join it is, but that probably won't effect how it's executed.
select ... from tshow.res_room_weight t join tshow.res_room_info r on (t.roomid = r.actorid) where t.subcatlg_id=46 and r.levels>=0;
--
john r pierce, recycling bits in santa cruz
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
On Wed, Oct 28, 2015 at 12:12 AM, 657985552@qq.com <657985552@qq.com> wrote:
Thank you for your reply.
tshow=> explain (analyze, buffers) select count(t.*) from
tshow.res_room_weight t,tshow.res_room_info r
tshow-> where t.subcatlg_id=46
tshow-> and t.roomid = r.actorid
tshow-> and r.levels>=0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=24581.57..24581.58 rows=1 width=60) (actual
time=112.613..112.613 rows=1 loops=1)
Buffers: shared hit=23439
-> Hash Join (cost=1737.31..24572.21 rows=3744 width=60) (actual
time=16.375..111.879 rows=3730 loops=1)
Hash Cond: (r.actorid = t.roomid)
Buffers: shared hit=23439
-> Seq Scan on res_room_info r (cost=0.00..22322.20 rows=22365
width=4) (actual time=0.031..86.026 rows=22365 loops=1)
Filter: (levels >= 0)
Rows Removed by Filter: 291
Buffers: shared hit=22039
-> Hash (cost=1689.90..1689.90 rows=3793 width=64) (actual
time=16.274..16.274 rows=3793 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 329kB
Buffers: shared hit=1397
-> Seq Scan on res_room_weight t (cost=0.00..1689.90
rows=3793 width=64) (actual time=0.348..14.663 rows=3793 loops=1)
Filter: (subcatlg_id = 46)
Rows Removed by Filter: 19641
Buffers: shared hit=1397
Total runtime: 112.754 mscost only 112.754 ms . the High load time log is :
2015-10-28 00:00:17.177 CST "SELECT",2015-10-23 19:30:20
CST,276/59546142,0,LOG,00000,"duration: 3300.237 ms execute <unnamed>:
SELECT o_count FROM
tshow.p_hpart_getcount_intitle($1)","parameters: $1 = '46'",,,,,,,,""
You really need it during the high-load condition. Perhaps
auto_explain could help.
Cheers,
Jeff
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi moran and others;
yesterday i get the pg problem again, and i use perf top Observation follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles], (all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_value
is spin lock problem ? I need everyone's help to solve the problem.thsnks!
------------------ 原始邮件 ------------------
发件人: "Merlin Moncure";<mmoncure@gmail.com>;
发送时间: 2015年10月28日(星期三) 凌晨2:35
收件人: "Bill Moran"<wmoran@potentialtech.com>;
抄送: "莎士比亚说:"<657985552@qq.com>; "pgsql-general"<pgsql-general@postgresql.org>;
主题: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Oct 27, 2015 at 12:14 PM, Bill Moran <wmoran@potentialtech.com> wrote:
On Tue, 27 Oct 2015 11:30:45 +0800
"657985552@qq.com" <657985552@qq.com> wrote:Dear sir:
Recently a wired question about postgresql database really bothered me a lot, so i really need your help. Here is the problem, in the most situations the postgre database work very well, Average 3500tps/s per day, the cpu usage of its process is 3%~10% and every query can be responsed in less than 20ms, but sometimes the cpu usages of its process can suddenly grow up to 90%+ , at that time a simple query can cost 2000+ms. ps: My postgresql version is 9.3.5 and the database is oltp server.9.3.5 is pretty old, you should probably schedule an upgrade.
shared_buffers | 25GB
Try setting this to 16GB. It's been a while since I tested on
large-memory/high-load systems, but I seem to remember that
shared_buffers above 16G could cause these sorts of intermittant
stalls.If that doesn't improve the situation, you'll probably need to
provide more details, specifically the layout of the table in
question, as well as the queries that are active when the
problem occurs, and the contents of the pg_locks table when
the problem is occurring.
possible culprits:
*) io based problems (check iowait, rule this out first)
*) THP compaction (rule this out second)
*) runaway query plan
*) concurrency problems within postgres itself (perf top capture
during load is essential)
maybe some other things I'm not thinking of.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
2015-11-08 14:22 GMT+01:00 莎士比亚说: <657985552@qq.com>:
Hi moran and others;
yesterday i get the pg problem again, and i use perf top Observation
follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles],
(all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_valueis spin lock problem ? I need everyone's help to solve the problem.thsnks!
yes, it is spin lock problem. you can try to limit number of active
connections -
http://www.enterprisedb.com/postgres-plus-edb-blog/amit-kapila/read-scalability-postgresql-95.
You can try 9.4, where some issues is fixed
When you hit this issue, the basic prerequisite is test case. If you can
run the simulation with this issue, then we can identify bottleneck.
You can try to check lwlocks via systemtap
http://postgres.cz/wiki/Monitorov%C3%A1n%C3%AD_lwlocku_pomoc%C3%AD_systemtapu
- the article is in Czech language, but this page support Google
transalator.
Regards
Pavel
On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:
Hi moran and others;
yesterday i get the pg problem again, and i use perf top Observation
follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles],
(all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_valueis spin lock problem ? I need everyone's help to solve the problem.thsnks!
Yup, spinlock problem. These can be difficult. What weneed now is
some metrics. They must be captured during load event:
*) number of active queries with average duration:
SELECT query, count(*), avg(now() - query_start) FROM
pg_stat_activity where state != 'idle' GROUP BY 1;
*) context switches via "vmstat 1" get a snapshot during load and
during non load time for comparison
Solution to this problem will probably be one or more of:
1) Significantly downgrade shared_buffers (say, to 4GB)
2) Upgrade database to 9.4 and hope for the best
3) Capture backtrace during load event to determine exactly which path
is going into spinlock
4) Install pgbouncer or another connection pooler to limit active
queries on database
5) Install experimental patches to custom compiled database to test
and verify a hypothetical fix
Out of those 5 things, which are possible for you to do? Best case
scenario is that you have non-essential server that reproduces the
issue.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:
Hi moran and others;
yesterday i get the pg problem again, and i use perf top Observation
follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles],
(all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_valueis spin lock problem ? I need everyone's help to solve the problem.thsnks!
Yup, spinlock problem. These can be difficult. What weneed now is
some metrics. They must be captured during load event:*) number of active queries with average duration:
SELECT query, count(*), avg(now() - query_start) FROM
pg_stat_activity where state != 'idle' GROUP BY 1;*) context switches via "vmstat 1" get a snapshot during load and
during non load time for comparisonSolution to this problem will probably be one or more of:
1) Significantly downgrade shared_buffers (say, to 4GB)2) Upgrade database to 9.4 and hope for the best
3) Capture backtrace during load event to determine exactly which path
is going into spinlock4) Install pgbouncer or another connection pooler to limit active
queries on database5) Install experimental patches to custom compiled database to test
and verify a hypothetical fixOut of those 5 things, which are possible for you to do? Best case
scenario is that you have non-essential server that reproduces the
issue.
Can we also see output of:
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
hi Merlin:
I'm sorry to reply so late. I don't know exactly the resault :
number of active queries with average duration:
SELECT query, count(*), avg(now() - query_start) FROM
pg_stat_activity where state != 'idle' GROUP BY 1;
there are avg 100 active connect during load even.every query cost 2000 ms+.
how can I Capture backtrace during load even?
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never
657985552@qq.com
From: Merlin Moncure
Date: 2015-11-10 23:55
To: 莎士比亚说:
CC: Bill Moran; pgsql-general
Subject: Re: [GENERAL] ??: postgres cpu 100% need help
On Tue, Nov 10, 2015 at 8:26 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Sun, Nov 8, 2015 at 7:22 AM, 莎士比亚说: <657985552@qq.com> wrote:
Hi moran and others;
yesterday i get the pg problem again, and i use perf top Observation
follows:
PerfTop: 11574 irqs/sec kernel: 2.2% exact: 0.0% [4000Hz cycles],
(all, 32 CPUs)
81.39% postgres [.] s_lock
5.42% postgres [.] LWLockAcquire
4.59% postgres [.] LWLockRelease
3.06% postgres [.] TransactionIdIsInProgress
0.38% postgres [.] PinBuffer
0.31% postgres [.] TransactionIdPrecedes
0.27% postgres [.] UnpinBuffer
0.19% postgres [.] TransactionIdIsCurrentTransactionId
0.16% postgres [.] heap_hot_search_buffer
0.15% [kernel] [k] number.isra.1
0.14% [kernel] [k] kallsyms_expand_symbol.constprop.1
0.10% [kernel] [k] module_get_kallsym
0.10% libc-2.17.so [.] __strcmp_sse42
0.09% [kernel] [k] _raw_spin_lock
0.09% postgres [.] hash_search_with_hash_valueis spin lock problem ? I need everyone's help to solve the problem.thsnks!
Yup, spinlock problem. These can be difficult. What weneed now is
some metrics. They must be captured during load event:*) number of active queries with average duration:
SELECT query, count(*), avg(now() - query_start) FROM
pg_stat_activity where state != 'idle' GROUP BY 1;*) context switches via "vmstat 1" get a snapshot during load and
during non load time for comparisonSolution to this problem will probably be one or more of:
1) Significantly downgrade shared_buffers (say, to 4GB)2) Upgrade database to 9.4 and hope for the best
3) Capture backtrace during load event to determine exactly which path
is going into spinlock4) Install pgbouncer or another connection pooler to limit active
queries on database5) Install experimental patches to custom compiled database to test
and verify a hypothetical fixOut of those 5 things, which are possible for you to do? Best case
scenario is that you have non-essential server that reproduces the
issue.
Can we also see output of:
cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
merlin
Import Notes
Reference msg id not found: 2015102616141260308220@qq.com
On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise never
There's your problem. You need to set those to "never". You can
get immediate relief by echoing 'never' to those pseudo-files, but
you need to configure your system to set them at OS boot time, too.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, Nov 23, 2015 at 9:30 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
On Mon, Nov 23, 2015 at 12:20 AM, 657985552@qq.com <657985552@qq.com> wrote:
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[root@pg1 ~]# cat /sys/kernel/mm/transparent_hugepage/defrag
[always] madvise neverThere's your problem. You need to set those to "never". You can
get immediate relief by echoing 'never' to those pseudo-files, but
you need to configure your system to set them at OS boot time, too.
yup. IMNSHO automatic enabling of THP defrag is one of the worst
performance related mistakes I've ever seen in the history of the
linux kernel (although maybe the blame lies with the distros...I'm not
sure). It caused a number of terrible outages here before we figured
out the underlying cause. (of course, everyone blamed postgres until
then)
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general