100% cpu usage on some postmaster processes kill the complete database
Hi List,
we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.
In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...
Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any update
I just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:
top - 15:55:02 up 59 days, 37 min, 1 user, load average: 35.95, 14.04, 7.32
Tasks: 2417 total, 54 running, 2363 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.3%us, 1.0%sy, 0.0%ni, 90.2%id, 1.9%wa, 0.0%hi, 0.6%si, 0.0%st
Mem: 264523700k total, 250145228k used, 14378472k free, 207032k buffers
Swap: 2097144k total, 553624k used, 1543520k free, 166905748k cachedPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29852 postgres 20 0 131g 59m 35m R 100.0 0.0 1:27.71 postmaster
29854 postgres 20 0 131g 70m 45m R 100.0 0.0 1:35.43 postmaster
17449 postgres 20 0 131g 1.2g 1.2g R 100.0 0.5 1:52.62 postmaster
29868 postgres 20 0 131g 1.1g 1.0g R 100.0 0.4 1:58.93 postmaster
30136 postgres 20 0 131g 77m 52m R 100.0 0.0 1:34.33 postmaster
30294 postgres 20 0 131g 66m 41m R 100.0 0.0 1:33.33 postmaster
30864 postgres 20 0 131g 66m 41m R 100.0 0.0 1:36.17 postmaster
30872 postgres 20 0 131g 61m 36m R 100.0 0.0 1:26.81 postmaster
30876 postgres 20 0 131g 68m 43m R 100.0 0.0 1:33.97 postmaster
30899 postgres 20 0 131g 68m 44m R 100.0 0.0 1:38.95 postmaster
30906 postgres 20 0 131g 67m 42m R 100.0 0.0 1:27.82 postmaster
31173 postgres 20 0 131g 68m 44m R 100.0 0.0 1:28.49 postmaster
31239 postgres 20 0 131g 71m 46m R 100.0 0.0 1:31.42 postmaster
31248 postgres 20 0 131g 90m 65m R 100.0 0.0 1:26.20 postmaster
34934 postgres 20 0 131g 5580 3456 R 100.0 0.0 1:23.96 postmaster
47945 postgres 20 0 131g 3.0g 3.0g R 100.0 1.2 6:08.41 postmaster
16116 postgres 20 0 131g 84m 59m R 100.0 0.0 1:30.60 postmaster
16304 postgres 20 0 131g 85m 60m R 100.0 0.0 1:38.89 postmaster
17104 postgres 20 0 131g 96m 72m R 100.0 0.0 1:27.54 postmaster
17111 postgres 20 0 131g 98m 73m R 100.0 0.0 1:38.23 postmaster
17320 postgres 20 0 131g 98m 74m R 100.0 0.0 1:38.51 postmaster
31221 postgres 20 0 131g 63m 38m R 100.0 0.0 1:33.63 postmaster
31272 postgres 20 0 131g 1.0g 1.0g R 100.0 0.4 1:32.71 postmaster
3290 postgres 20 0 131g 99m 74m R 100.0 0.0 1:32.76 postmaster
3459 postgres 20 0 131g 2.1g 2.0g R 100.0 0.8 1:44.92 postmaster
16492 postgres 20 0 131g 100m 75m R 100.0 0.0 1:33.36 postmaster
16562 postgres 20 0 131g 114m 89m R 100.0 0.0 1:35.14 postmaster
17146 postgres 20 0 131g 91m 66m R 100.0 0.0 1:37.39 postmaster
17403 postgres 20 0 131g 98m 73m R 100.0 0.0 1:32.13 postmaster
31100 postgres 20 0 131g 62m 38m R 100.0 0.0 1:29.06 postmaster
2019 postgres 20 0 131g 1.2g 1.2g R 98.7 0.5 1:40.91 postmaster
2150 postgres 20 0 131g 1.3g 1.3g R 98.7 0.5 2:53.14 postmaster
16048 postgres 20 0 131g 71m 46m R 98.7 0.0 1:29.75 postmaster
30190 postgres 20 0 131g 1.4g 1.3g R 98.7 0.5 0:55.98 postmaster
16112 postgres 20 0 131g 862m 827m R 97.1 0.3 0:48.00 postmaster
31202 postgres 20 0 131g 74m 49m R 97.1 0.0 1:34.62 postmaster
35658 postgres 20 0 131g 5948 3788 R 97.1 0.0 0:12.29 postmaster
16134 postgres 20 0 131g 1.9g 1.9g R 95.4 0.8 1:47.27 postmaster
31034 postgres 20 0 131g 69m 44m R 95.4 0.0 1:26.35 postmaster
16120 postgres 20 0 131g 1.2g 1.2g R 93.8 0.5 2:04.02 postmaster
30891 postgres 20 0 131g 57m 33m R 93.8 0.0 1:23.08 postmaster
31261 postgres 20 0 131g 81m 56m R 93.8 0.0 1:24.51 postmaster
29790 postgres 20 0 131g 62m 37m R 92.2 0.0 1:35.34 postmaster
30426 postgres 20 0 131g 62m 37m R 87.4 0.0 1:34.51 postmaster
30857 postgres 20 0 131g 50m 26m R 79.3 0.0 1:37.82 postmaster
507 root 39 19 0 0 0 R 67.9 0.0 19:19.71 khugepaged
16095 postgres 20 0 131g 83m 58m R 67.9 0.0 1:27.64 postmaster
30856 postgres 20 0 131g 69m 44m R 67.9 0.0 1:34.46 postmaster
17442 postgres 20 0 131g 2.4g 2.4g S 11.3 0.9 1:02.14 postmaster
Postgresql Version information:
- PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.5 20110214 (Red Hat 4.4.5-6), 64-bit
- Running Hot Replication to another node (same hardware setup there)
Server Hardware:
- 4x 12 Core AMD Magny cours
- 256 GB of RAM (36% currently used)
- 1,3 TB SAS Raid (LSI Raid controller) - 15k rpm
If i lost to include some important informations for you analyzing my problem, let me please know. I did my best to post the question as accurate as possible for me.
--
Mit freundlichen Grüßen
Paul Dunkler
On 01/03/12 16:41, Paul Dunkler wrote:
Hi List,
we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.
In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...
Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any updateI just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:
Combine that with this:
SELECT * FROM pg_stat_activity;
That will let you line up pids from top with active queries.
--
Richard Huxton
Archonet Ltd
I did that now - and analyzed the situation a bit. There are only queries running which will process very fast under high load (only index scans, very low rates of sequential scans). I found a remarkable number of Insert statements...
And sometimes when that happens, the CPU Utilization is going up to nearby 100% too and 98% is system usage...
Am 01.03.2012 um 18:02 schrieb Richard Huxton:
On 01/03/12 16:41, Paul Dunkler wrote:
Hi List,
we are currently running a rather large postgresql-installation with approximately 4k Transactions and 50k index scans per second.
In the last days on some times of the day (irregular - 3-4 times a day), some of the postmaster processes are running with 100% cpu usage. That leads to a totally breakdown of the query execution. We see tons of statements which are correctly automatically aborted by our statement_timeout set to 15 seconds. I tried to search, but do not really recognize what the problem could be there...
Some things i have checked:
- We are not running any bulk jobs or maintenance scripts at this time
- No system errors in any logs during that slowdowns
- I/O Performance seems fine. No high IO Wait amount... But IO Write totally drops in that times because it seems that no postgres process can perform any updateI just installed a script, which prints me out the top and ps axf information for facing out the problem. I will post a snippet of the top here:
Combine that with this:
SELECT * FROM pg_stat_activity;That will let you line up pids from top with active queries.
--
Richard Huxton
Archonet Ltd--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Mit freundlichen Grüßen
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------
Attachments:
On 01/03/12 19:41, Paul Dunkler wrote:
I did that now - and analyzed the situation a bit. There are only queries
running which will process very fast under high load (only index scans, very low
rates of sequential scans). I found a remarkable number of Insert statements...And sometimes when that happens, the CPU Utilization is going up to nearby 100%
too and 98% is system usage...
You're running on a box larger than I'm used to, so this is only
speculation. I'm wondering whether you're hitting problems with lock
contention or some such. It looks like you've got 48 cores there all at
about 100% possibly none of them getting much chance to do any work.
Oddly, the totals you posted in your top output show 6.3% user cpu
usage, which I can't make match with 50-odd processes all approaching
100% cpu.
Perhaps have a look at vmstat output too - see if context-switches spike
unusually high during these periods (sorry - no idea what an unusually
high number would be on a machine like yours).
Reducing the number of concurrent backends might help, but that rather
depends on whether my guess is right.
If no-one more experienced than me comes along shortly, try reposting to
the performance list. There are people there who are used to machines of
this size.
--
Richard Huxton
Archonet Ltd
Hi,
You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting problems with lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of them getting much chance to do any work.
Yes. That is what i see too...
Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd processes all approaching 100% cpu.
Sometimes the Cpu is only 7% used in this times but at other peak times, the cpu is used 100% (97% system load) as i posted before.
Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry - no idea what an unusually high number would be on a machine like yours).
Thanks. i will have a look at it.
Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right.
Yes... already thought about setting up a connection pool.
If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there who are used to machines of this size.
Thanks. I will wait a time and consider re-posting it to the perfornance list.
--
Richard Huxton
Archonet Ltd--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Mit freundlichen Grüßen
Paul Dunkler
-----------------------------------------------------
XYRALITY GmbH • Lerchenstraße 28a • 22767 Hamburg
Paul Dunkler • Softwareentwickler
Mail: paul.dunkler@xyrality.com
Tel: +49 (0) 40 23 51 78 97
Mobil: +49 (0) 151 252 228 42
Fax: +49 (0) 40 23 51 78 98
Web: http://www.xyrality.com/
Registergericht: Hamburg HRB 115332
Geschäftsführer: Sven Ossenbrüggen & Alexander Spohr
-----------------------------------------------------
Attachments:
On Thu, Mar 1, 2012 at 1:52 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
Hi,
You're running on a box larger than I'm used to, so this is only speculation. I'm wondering whether you're hitting problems with lock contention or some such. It looks like you've got 48 cores there all at about 100% possibly none of them getting much chance to do any work.
Yes. That is what i see too...
Oddly, the totals you posted in your top output show 6.3% user cpu usage, which I can't make match with 50-odd processes all approaching 100% cpu.
Sometimes the Cpu is only 7% used in this times but at other peak times, the cpu is used 100% (97% system load) as i posted before.
Perhaps have a look at vmstat output too - see if context-switches spike unusually high during these periods (sorry - no idea what an unusually high number would be on a machine like yours).
Thanks. i will have a look at it.
Reducing the number of concurrent backends might help, but that rather depends on whether my guess is right.
Yes... already thought about setting up a connection pool.
If no-one more experienced than me comes along shortly, try reposting to the performance list. There are people there who are used to machines of this size.
Thanks. I will wait a time and consider re-posting it to the perfornance list.
I'd look at vmstat and iostat output (vmstat 10, iostat -xd 10) for a
few minutes. In vmstat look for high (>100k) ints or cs numbers, in
iostat look at io utilization.