BUG #6650: CPU system time utilization rising few times a day
The following bug has been logged on the website:
Bug reference: 6650
Logged by: Andrzej Krawiec
Email address: a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:
Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU is
running 99% in systime. IO is OK, only interrupts are extremely high (over
100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does anyone
have any idea?
On Fri, May 18, 2012 at 5:09 AM, <a.krawiec@focustelecom.pl> wrote:
The following bug has been logged on the website:
Bug reference: 6650
Logged by: Andrzej Krawiec
Email address: a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes. CPU is
running 99% in systime. IO is OK, only interrupts are extremely high (over
100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does anyone
have any idea?
Try using strace to figure out where all that system time is going.
Sometimes the '-c' option is helpful.
It might also be helpful to connect gdb to the process and get a
backtrace, then continue, stop it again, get another backtrace.
Repeat that a few times and send us the backtrace that occurs most
frequently.
Is it a regular backend that is eating all that CPU time, or an
autovacuum worker?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Cannot strace or gdb on a production system under heavy load (about 100
transactions per second).
It's in kernel space not user, so we are unable to anything at this
particular moment (sometimes even the ssh connection seems to hang for a
while).
We suspect neither autovacuum (although suspected primarily) nor regular
backend. It is system time. The question is: what's the reasone for that?
We've dug through system and postgres logs, cleared out most of the long
query problems, idle in transaction, optimized queries, vacuumed, reindexed
and such.
For a while it seemed like the particular kernel version is causing
majority of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and
those problems went mostly! away. For few days we had no situations, but it
happened again.
Regards
--
Andrzej Krawiec
2012/5/22 Robert Haas <robertmhaas@gmail.com>
Show quoted text
On Fri, May 18, 2012 at 5:09 AM, <a.krawiec@focustelecom.pl> wrote:
The following bug has been logged on the website:
Bug reference: 6650
Logged by: Andrzej Krawiec
Email address: a.krawiec@focustelecom.pl
PostgreSQL version: 8.4.11
Operating system: CentOS 6.0 - 2.6.32-220.13.1.el6.x86_64
Description:Primarily checked on PG 8.4.9 (same OS), problem also occurs. Few times a
day I get a situation where PostgreSQL stops running for 1-2 minutes.CPU is
running 99% in systime. IO is OK, only interrupts are extremely high
(over
100k). System operates on 2 x Xeon 10 Core, 128 GB RAM, raid 10. Does
anyone
have any idea?
Try using strace to figure out where all that system time is going.
Sometimes the '-c' option is helpful.It might also be helpful to connect gdb to the process and get a
backtrace, then continue, stop it again, get another backtrace.
Repeat that a few times and send us the backtrace that occurs most
frequently.Is it a regular backend that is eating all that CPU time, or an
autovacuum worker?--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 23, 2012 at 5:29 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:
Cannot strace or gdb on a production system under heavy load (about 100
transactions per second).
It's in kernel space not user, so we are unable to anything at this
particular moment (sometimes even the ssh connection seems to hang for a
while).
We suspect neither autovacuum (although suspected primarily) nor regular
backend. It is system time. The question is: what's the reasone for that?
We've dug through system and postgres logs, cleared out most of the long
query problems, idle in transaction, optimized queries, vacuumed, reindexed
and such.
For a while it seemed like the particular kernel version is causing majority
of problems. We have downgraded to 2.6.32.-71.29.1.el6.x86_64 and those
problems went mostly! away. For few days we had no situations, but it
happened again.
perf can tell you about problems in kernel-space, but I'm not sure it
exists that far back.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Ok, we've managed to do strace -s during such a situation (see
attached file). I have no clue what can it mean. Only errors count is
quite strange.
Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html
affect our environment?
--
Andrzej Krawiec
Show quoted text
perf can tell you about problems in kernel-space, but I'm not sure it
exists that far back.--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On Thu, May 31, 2012 at 3:29 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:
Ok, we've managed to do strace -s during such a situation (see
attached file). I have no clue what can it mean. Only errors count is
quite strange.
How long was strace -s run for to generate this?
Could this http://postgresql.1045698.n5.nabble.com/high-CPU-usage-for-stats-collector-in-8-2-td1962590.html
affect our environment?
Not sure, but that's a much older version of PostgreSQL than the one
you're running, and I think there may have been some improvements
meanwhile.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:
2012/5/31 Robert Haas <robertmhaas@gmail.com>:
How long was strace -s run for to generate this?
Strace - s was running for about 2 minutes.
Hmm, I'm sort of confused then. This only shows a total of 1.815816
seconds of system time, which is a lot less than 99% utilization * 2
minutes.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Import Notes
Reply to msg id not found: CAAy64HjGauC8=+60EbRgOG516+j_rxkHm5Qct+P_7vpXVr_80Q@mail.gmail.com
It has happened again. Twice today. We have one perf dump. But it is over
600 MB of bzip2-ed data (3.5 GB of plain binary perf data).
We'll go deeper into that, and if we don't come to any conclusions we will
upload it to our anonymous FTP.
Don't know if this perf dump is ok, as it was again triggered by our sciprt
upon system load rise. Maybe it is too late...
Another question: is errors count from my strace log not high?
--
Andrzej Krawiec
2012/6/8 Robert Haas <robertmhaas@gmail.com>
Show quoted text
On Fri, Jun 1, 2012 at 2:17 AM, Andrzej Krawiec
<a.krawiec@focustelecom.pl> wrote:2012/5/31 Robert Haas <robertmhaas@gmail.com>:
How long was strace -s run for to generate this?
Strace - s was running for about 2 minutes.
Hmm, I'm sort of confused then. This only shows a total of 1.815816
seconds of system time, which is a lot less than 99% utilization * 2
minutes.
Ok, we are 99% sure as for what is the reason for this situation. Let's say
we have a 1 GB of free memory. Any process (especially postgres, as it is
dedicated server) wants to allocate little more. Kernel has to look through
cached memory to determine which cache pages can be dropped and allocated
as user memory. If the server is idle (almost no queries) allocating huge
amounts of memory is almost invisible, but under production load (about
100-200 transactions per second) system halts for a while, to drop some of
cache pages.
We consider setting kernel value extra_free_kbytes to some reasonably high
value to check if it helps keeping free memory at certain level.
It seems that it's not a PostgreSQL bug.
Can anyone try to confirm this?
--
Regards
Andrzej Krawiec