BUG #6650: CPU system time utilization rising few times a day

Started by Andrzej Krawiecalmost 14 years ago9 messagesbugs
Jump to latest
#1Andrzej Krawiec
a.krawiec@focustelecom.pl

The following bug has been logged on the website:

Bug reference: 6650
Logged by: Andrzej Krawiec
Email address: a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:

Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU is
running 99% in systime. IO is OK, only interrupts are extremely high (over
100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does anyone
have any idea?

#2Robert Haas
robertmhaas@gmail.com
In reply to: Andrzej Krawiec (#1)
Re: BUG #6650: CPU system time utilization rising few times a day

On Fri, May 18, 2012 at 5:09 AM, <a.krawiec@focustelecom.pl> wrote:

The following bug has been logged on the website:

Bug reference:      6650
Logged by:          Andrzej Krawiec
Email address:      a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system:   CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:

Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU is
running 99% in systime. IO is OK, only interrupts are extremely high (over
100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does anyone
have any idea?

Try using strace to figure out where all that system time is going.
Sometimes the '-c' option is helpful.

It might also be helpful to connect gdb to the process and get a
backtrace, then continue, stop it again, get another backtrace.
Repeat that a few times and send us the backtrace that occurs most
frequently.

Is it a regular backend that is eating all that CPU time, or an
autovacuum worker?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Andrzej Krawiec
a.krawiec@focustelecom.pl
In reply to: Robert Haas (#2)
Re: BUG #6650: CPU system time utilization rising few times a day

Cannot strace or gdb on a production system under heavy load (about 100
transactions per second).
It's in kernel space not user, so we are unable to anything at this
particular moment (sometimes even the ssh connection seems to hang for a
while).
We suspect neither autovacuum (although suspected primarily) nor regular
backend. It is system time. The question is: what's the reasone for that?
We've dug through system and postgres logs, cleared out most of the long
query problems, idle in transaction, optimized queries, vacuumed, reindexed
and such.
For a while it seemed like the particular kernel version is causing
majority of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and
those problems went mostly! away. For few days we had no situations, but it
happened again.

Regards
--
Andrzej Krawiec

2012/5/22 Robert Haas <robertmhaas@gmail.com>

Show quoted text

On Fri, May 18, 2012 at 5:09 AM, <a.krawiec@focustelecom.pl> wrote:

The following bug has been logged on the website:

Bug reference: 6650
Logged by: Andrzej Krawiec
Email address: a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:

Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes.

CPU is

running 99% in systime. IO is OK, only interrupts are extremely high

(over

100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does

anyone

have any idea?

Try using strace to figure out where all that system time is going.
Sometimes the '-c' option is helpful.

It might also be helpful to connect gdb to the process and get a
backtrace, then continue, stop it again, get another backtrace.
Repeat that a few times and send us the backtrace that occurs most
frequently.

Is it a regular backend that is eating all that CPU time, or an
autovacuum worker?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andrzej Krawiec (#3)
Re: BUG #6650: CPU system time utilization rising few times a day

On Wed, May 23, 2012 at 5:29 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:

Cannot strace or gdb on a production system under heavy load (about 100
transactions per second).
It's in kernel space not user, so we are unable to anything at this
particular moment (sometimes even the ssh connection seems to hang for a
while).
We suspect neither autovacuum (although suspected primarily) nor regular
backend. It is system time. The question is: what's the reasone for that?
We've dug through system and postgres logs, cleared out most of the long
query problems, idle in transaction, optimized queries, vacuumed, reindexed
and such.
For a while it seemed like the particular kernel version is causing majority
of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and those
problems went mostly! away. For few days we had no situations, but it
happened again.

perf can tell you about problems in kernel-space, but I'm not sure it
exists that far back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Andrzej Krawiec
a.krawiec@focustelecom.pl
In reply to: Robert Haas (#4)
Re: BUG #6650: CPU system time utilization rising few times a day

Ok, we've managed to do strace -s during such a situation (see
attached file). I have no clue what can it mean. Only errors count is
quite strange.
Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html
affect our environment?

--
Andrzej Krawiec

Show quoted text

perf can tell you about problems in kernel-space, but I'm not sure it
exists that far back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

strace_2012-05-31_08_38_44.logapplication/octet-stream; name=strace_2012-05-31_08_38_44.logDownload
#6Robert Haas
robertmhaas@gmail.com
In reply to: Andrzej Krawiec (#5)
Re: BUG #6650: CPU system time utilization rising few times a day

On Thu, May 31, 2012 at 3:29 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:

Ok, we've managed to do strace -s during such a situation (see
attached file). I have no clue what can it mean. Only errors count is
quite strange.

How long was strace -s run for to generate this?

Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html
affect our environment?

Not sure, but that's a much older version of PostgreSQL than the one
you're running, and I think there may have been some improvements
meanwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andrzej Krawiec (#1)
Re: BUG #6650: CPU system time utilization rising few times a day

On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:

2012/5/31 Robert Haas <robertmhaas@gmail.com>:

How long was strace -s run for to generate this?

Strace - s was running for about 2 minutes.

Hmm, I'm sort of confused then. This only shows a total of 1.815816
seconds of system time, which is a lot less than 99% utilization * 2
minutes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Andrzej Krawiec
a.krawiec@focustelecom.pl
In reply to: Robert Haas (#7)
Re: BUG #6650: CPU system time utilization rising few times a day

It has happened again. Twice today. We have one perf dump. But it is over
600 MB of bzip2-ed data (3.5 GB of plain binary perf data).
We'll go deeper into that, and if we don't come to any conclusions we will
upload it to our anonymous FTP.

Don't know if this perf dump is ok, as it was again triggered by our sciprt
upon system load rise. Maybe it is too late...

Another question: is errors count from my strace log not high?

--
Andrzej Krawiec

2012/6/8 Robert Haas <robertmhaas@gmail.com>

Show quoted text

On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:

2012/5/31 Robert Haas <robertmhaas@gmail.com>:

How long was strace -s run for to generate this?

Strace - s was running for about 2 minutes.

Hmm, I'm sort of confused then. This only shows a total of 1.815816
seconds of system time, which is a lot less than 99% utilization * 2
minutes.

#9Andrzej Krawiec
a.krawiec@focustelecom.pl
In reply to: Andrzej Krawiec (#8)
Re: BUG #6650: CPU system time utilization rising few times a day

Ok, we are 99% sure as for what is the reason for this situation. Let's say
we have a 1 GB of free memory. Any process (especially postgres, as it is
dedicated server) wants to allocate little more. Kernel has to look through
cached memory to determine which cache pages can be dropped and allocated
as user memory. If the server is idle (almost no queries) allocating huge
amounts of memory is almost invisible, but under production load (about
100-200 transactions per second) system halts for a while, to drop some of
cache pages.

We consider setting kernel value extra_free_kbytes to some reasonably high
value to check if it helps keeping free memory at certain level.

It seems that it's not a PostgreSQL bug.

Can anyone try to confirm this?

--
Regards
Andrzej Krawiec