Server hitting 100% CPU usage, system comes to a crawl.

Started by Brian Fehrleover 14 years ago16 messagesgeneral

brianf@consistentstate.com

over 14 years ago

Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's <IDLE>. On average, at any given time, there are
3 - 6 connections that are actually executing a query, while the rest
are <IDLE>.

About once a day, queries that normally take just a few seconds slow way
down, and start to pile up, to the point where instead of just having
3-6 queries running at any given time, we get 100 - 200. The whole
system comes to a crawl, and looking at top, the CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there are a
large number of postmaster processes at 100% cpu usage (makes sense, at
this point there are 150 or so queries currently executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached

In the past, we noticed that autovacuum was hitting some large tables at
the same time this happened, so we turned autovacuum off to see if that
was the issue, and it still happened without any vacuums running.

We also ruled out checkpoints being the cause.

I'm currently digging through some statistics I've been gathering to see
if traffic increased at all, or remained the same when the slowdown
occurred. I'm also digging through the logs from the postgresql cluster
(I increased verbosity yesterday), looking for any clues. Any
suggestions or clues on where to look for this to see what can be
causing a slowdown like this would be greatly appreciated.

Thanks,
- Brian F

John R Pierce

pierce@hogranch.com

over 14 years ago

In reply to: Brian Fehrle (#1)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On 10/27/11 11:39 AM, Brian Fehrle wrote:

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's <IDLE>. On average, at any given time, there
are 3 - 6 connections that are actually executing a query, while the
rest are <IDLE>.

thats not a very effective use of pooling. the pooling model, you'd
have a connection pool sufficient actual database connections to satisfy
your concurrency requirements, and your apps would grab a connection
from the pool, do a transaction, then release the connection back to the
pool.

now, I don't know that this has anything to do with your performance
problem, I'm just pointing out this anomaly. a pool doesn't do much
good if the clients grab a connection and just sit on it.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

Scott Marlowe

scott.marlowe@gmail.com

over 14 years ago

In reply to: Brian Fehrle (#1)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On Thu, Oct 27, 2011 at 12:39 PM, Brian Fehrle
<brianf@consistentstate.com> wrote:

Looking at top, I see no SWAP usage, very little IOWait, and there are a
large number of postmaster processes at 100% cpu usage (makes sense, at this
point there are 150 or so queries currently executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached

OK, a few points. 1: You've got a zombie process. Find out what's
causing that, it could be a trigger of some type for this behaviour.
2: You're 92% sys. That's bad. It means the OS is chewing up 92% of
your 32 cores doing something. what tasks are at the top of the list
in top?

Try running vmstat 10 for a a minute or so then look at cs and int
columns. If cs or int is well over 100k there could be an issue with
thrashing, where your app is making some change to the db that
requires all backends to be awoken at once and the machine just falls
over under the load.

Scott Marlowe

scott.marlowe@gmail.com

over 14 years ago

In reply to: Scott Marlowe (#3)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On Thu, Oct 27, 2011 at 1:48 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

OK, a few points. 1: You've got a zombie process. Find out what's

To expand on the zombie thing, it's quite possible that you're
managing to make a pg backend process crashout, which would cause the
db to restart midday, which is bad (TM) since that dumps all of shared
buffers and forces all clients to reconnect. So look through the
system logs for segmentation faults, etc.

Scott Marlowe

scott.marlowe@gmail.com

over 14 years ago

In reply to: Scott Marlowe (#4)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On Thu, Oct 27, 2011 at 1:52 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Thu, Oct 27, 2011 at 1:48 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

OK, a few points. 1: You've got a zombie process. Find out what's

To expand on the zombie thing, it's quite possible that you're
managing to make a pg backend process crashout, which would cause the
db to restart midday, which is bad (TM) since that dumps all of shared
buffers and forces all clients to reconnect. So look through the
system logs for segmentation faults, etc.

One last thing, you should upgrade to the latest 8.3 version to see if
that helps. There was a bug fix around 8.3.13 or so that stopped
postgresql from restarting due to a simple data corruption issue that
should have only resulted in an error message not a restart of the db.
I know, cause I found it. :) Thanks to the pg devs for fixing it.

Brian Fehrle

brianf@consistentstate.com

over 14 years ago

In reply to: Scott Marlowe (#3)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On 10/27/2011 01:48 PM, Scott Marlowe wrote:

On Thu, Oct 27, 2011 at 12:39 PM, Brian Fehrle
<brianf@consistentstate.com> wrote:

Looking at top, I see no SWAP usage, very little IOWait, and there are a
large number of postmaster processes at 100% cpu usage (makes sense, at this
point there are 150 or so queries currently executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached

OK, a few points. 1: You've got a zombie process. Find out what's
causing that, it could be a trigger of some type for this behaviour.
2: You're 92% sys. That's bad. It means the OS is chewing up 92% of
your 32 cores doing something. what tasks are at the top of the list
in top?

Out of the top 50 processes in top, 48 of them are postmasters, one is
syslog, and one is psql. Each of the postmasters have a high %CPU, the
top ones being 80% and higher, the rest being anywhere between 30% -
60%. Would postmaster 'queries' that are running attribute to the sys
CPU usage, or should they be under the 'us' CPU usage?

Try running vmstat 10 for a a minute or so then look at cs and int
columns. If cs or int is well over 100k there could be an issue with
thrashing, where your app is making some change to the db that
requires all backends to be awoken at once and the machine just falls
over under the load.

We've restarted the postgresql cluster, so the issue is not happening at
this moment. but running a vmstat 10 had my 'cs' average at 3K and 'in'
averaging around 9.5K.

- Brian F

Brian Fehrle

brianf@consistentstate.com

over 14 years ago

In reply to: Brian Fehrle (#6)

Re: Server hitting 100% CPU usage, system comes to a crawl.

Also, I'm not having any issue with the database restarting itself,
simply becoming unresponsive / slow to respond, to the point where just
sshing to the box takes about 30 seconds if not longer. Performing a
pg_ctl restart on the cluster resolves the issue.

I looked through the logs for any segmentation faults, none found. In
fact the only thing in my log that seems to be 'bad' are the following.

Oct 27 08:53:18 <snip> postgres[17517]: [28932839-1]
user=<snip>,db=<snip> ERROR: deadlock detected
Oct 27 11:49:22 <snip> postgres[608]: [19-1] user=<snip>,db=<snip>
ERROR: could not serialize access due to concurrent update

I don't believe these occurred too close to the slowdown.

- Brian F

Show quoted text

On 10/27/2011 02:09 PM, Brian Fehrle wrote:

On 10/27/2011 01:48 PM, Scott Marlowe wrote:

On Thu, Oct 27, 2011 at 12:39 PM, Brian Fehrle
<brianf@consistentstate.com> wrote:

Looking at top, I see no SWAP usage, very little IOWait, and there
are a
large number of postmaster processes at 100% cpu usage (makes sense,
at this
point there are 150 or so queries currently executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k
buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k
cached

OK, a few points. 1: You've got a zombie process. Find out what's
causing that, it could be a trigger of some type for this behaviour.
2: You're 92% sys. That's bad. It means the OS is chewing up 92% of
your 32 cores doing something. what tasks are at the top of the list
in top?

Out of the top 50 processes in top, 48 of them are postmasters, one is
syslog, and one is psql. Each of the postmasters have a high %CPU, the
top ones being 80% and higher, the rest being anywhere between 30% -
60%. Would postmaster 'queries' that are running attribute to the sys
CPU usage, or should they be under the 'us' CPU usage?

Try running vmstat 10 for a a minute or so then look at cs and int
columns. If cs or int is well over 100k there could be an issue with
thrashing, where your app is making some change to the db that
requires all backends to be awoken at once and the machine just falls
over under the load.

We've restarted the postgresql cluster, so the issue is not happening
at this moment. but running a vmstat 10 had my 'cs' average at 3K and
'in' averaging around 9.5K.

- Brian F

Scott Mead

scottm@openscg.com

over 14 years ago

In reply to: Brian Fehrle (#1)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle <brianf@consistentstate.com>wrote:

Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have connection
pooling set up, with about 100 - 200 persistent connections open to the
database. Our applications then use these connections to query the database
constantly, but when a connection isn't currently executing a query, it's
<IDLE>. On average, at any given time, there are 3 - 6 connections that are
actually executing a query, while the rest are <IDLE>.

Remember, when you read pg_stat_activity, it is showing you query activity
from that exact specific moment in time. Just because it looks like only
3-6 connections are executing, doesn't mean that 200 aren't actually
executing < .1ms statements. With such a beefy box, I would see if you can
examine any stats from your connection pooler to find out how many
connections are actually getting used.

About once a day, queries that normally take just a few seconds slow way
down, and start to pile up, to the point where instead of just having 3-6
queries running at any given time, we get 100 - 200. The whole system comes
to a crawl, and looking at top, the CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there are a
large number of postmaster processes at 100% cpu usage (makes sense, at this
point there are 150 or so queries currently executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached

In the past, we noticed that autovacuum was hitting some large tables at
the same time this happened, so we turned autovacuum off to see if that was
the issue, and it still happened without any vacuums running.

That was my next question :)

We also ruled out checkpoints being the cause.

How exactly did you rule this out? Just because a checkpoint is over

doesn't mean that it hasn't had a negative effect on the OS cache. If
you're stuck going to disk, that could be hurting you (that being said, you
do point to a low I/O wait above, so you're probably correct in ruling this
out).

I'm currently digging through some statistics I've been gathering to see if
traffic increased at all, or remained the same when the slowdown occurred.
I'm also digging through the logs from the postgresql cluster (I increased
verbosity yesterday), looking for any clues. Any suggestions or clues on
where to look for this to see what can be causing a slowdown like this would
be greatly appreciated.

Are you capturing table-level stats from pg_stat_user_[tables | indexes]?

Just because a server doesn't look busy doesn't mean that you're not doing
1000 index scans per second returning 1000 tuples each time.

--Scott

Show quoted text

Thanks,
- Brian F

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/**mailpref/pgsql-general<http://www.postgresql.org/mailpref/pgsql-general>

Brian Fehrle

brianf@consistentstate.com

over 14 years ago

In reply to: Scott Mead (#8)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On 10/27/2011 02:27 PM, Scott Mead wrote:

On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle
<brianf@consistentstate.com <mailto:brianf@consistentstate.com>> wrote:

Hi all, need some help/clues on tracking down a performance issue.

PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent
connections open to the database. Our applications then use these
connections to query the database constantly, but when a
connection isn't currently executing a query, it's <IDLE>. On
average, at any given time, there are 3 - 6 connections that are
actually executing a query, while the rest are <IDLE>.

Remember, when you read pg_stat_activity, it is showing you query
activity from that exact specific moment in time. Just because it
looks like only 3-6 connections are executing, doesn't mean that 200
aren't actually executing < .1ms statements. With such a beefy box, I
would see if you can examine any stats from your connection pooler to
find out how many connections are actually getting used.

Correct, we're getting a few hundred transactions per second, but under
normal operation, polling pg_stat_activity will show the average of 3 -
6 queries that were running at that moment, and those queries run for an
average of 5 - 7 seconds. So my belief is that something happens to the
system where either a) We get a ton more queries than normal from the
application (currently hunting down data to support this), or b) the
overall speed of the system slows down so that all queries increase in
time so much that polling pg_stat_activity lets me actually see them.

About once a day, queries that normally take just a few seconds
slow way down, and start to pile up, to the point where instead of
just having 3-6 queries running at any given time, we get 100 -
200. The whole system comes to a crawl, and looking at top, the
CPU usage is 99%.

Looking at top, I see no SWAP usage, very little IOWait, and there
are a large number of postmaster processes at 100% cpu usage
(makes sense, at this point there are 150 or so queries currently
executing on the database).

Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi,
0.3%si, 0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free, 462444k
buffers
Swap: 8388600k total, 296k used, 8388304k free, 119029580k
cached

In the past, we noticed that autovacuum was hitting some large
tables at the same time this happened, so we turned autovacuum off
to see if that was the issue, and it still happened without any
vacuums running.

That was my next question :)

We also ruled out checkpoints being the cause.

How exactly did you rule this out? Just because a checkpoint is over
doesn't mean that it hasn't had a negative effect on the OS cache. If
you're stuck going to disk, that could be hurting you (that being
said, you do point to a low I/O wait above, so you're probably correct
in ruling this out).

Checkpoint settings were set to the default per install. 5 minute
timeout, 0.5 completion target, and 30s warning. Looking at the logs, we
were getting a checkpoint every 5 minutes on the dot.

I looked at the data in pg_stat_database and noticed that buffers
written by checkpoints are near 4X that of the background writer. So I
implemented some changes to get more to be written by the background
writer, including increasing the checkpoint timeout to 30 minutes, and
setting the frequency of the bgwriter wait time from 200ms to 50ms.

checkpoints now happen 30 mins apart on the dot, and there was not a
checkpoint happening the last time this issue of major slowdown occured.

I'm currently digging through some statistics I've been gathering
to see if traffic increased at all, or remained the same when the
slowdown occurred. I'm also digging through the logs from the
postgresql cluster (I increased verbosity yesterday), looking for
any clues. Any suggestions or clues on where to look for this to
see what can be causing a slowdown like this would be greatly
appreciated.

Are you capturing table-level stats from pg_stat_user_[tables |
indexes]? Just because a server doesn't look busy doesn't mean that
you're not doing 1000 index scans per second returning 1000 tuples
each time.

I am not grabbing any of those at the moment, I'll look into those.

- Brian F

Show quoted text

--Scott

Thanks,
- Brian F

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Brian Fehrle (#1)

Re: Server hitting 100% CPU usage, system comes to a crawl.

Brian Fehrle <brianf@consistentstate.com> writes:

Hi all, need some help/clues on tracking down a performance issue.
PostgreSQL version: 8.3.11

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's <IDLE>. On average, at any given time, there are
3 - 6 connections that are actually executing a query, while the rest
are <IDLE>.

About once a day, queries that normally take just a few seconds slow way
down, and start to pile up, to the point where instead of just having
3-6 queries running at any given time, we get 100 - 200. The whole
system comes to a crawl, and looking at top, the CPU usage is 99%.

This is jumping to a conclusion based on insufficient data, but what you
describe sounds a bit like the sinval queue contention problems that we
fixed in 8.4. Some prior reports of that:
http://archives.postgresql.org/pgsql-performance/2008-01/msg00001.php
http://archives.postgresql.org/pgsql-performance/2010-06/msg00452.php

If your symptoms match those, the best fix would be to update to 8.4.x
or later, but a stopgap solution would be to cut down on the number of
idle backends.

regards, tom lane

#11

Brian Fehrle

brianf@consistentstate.com

over 14 years ago

In reply to: Tom Lane (#10)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On 10/27/2011 02:50 PM, Tom Lane wrote:

Brian Fehrle<brianf@consistentstate.com> writes:

Hi all, need some help/clues on tracking down a performance issue.
PostgreSQL version: 8.3.11
I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's<IDLE>. On average, at any given time, there are
3 - 6 connections that are actually executing a query, while the rest
are<IDLE>.
About once a day, queries that normally take just a few seconds slow way
down, and start to pile up, to the point where instead of just having
3-6 queries running at any given time, we get 100 - 200. The whole
system comes to a crawl, and looking at top, the CPU usage is 99%.

This is jumping to a conclusion based on insufficient data, but what you
describe sounds a bit like the sinval queue contention problems that we
fixed in 8.4. Some prior reports of that:
http://archives.postgresql.org/pgsql-performance/2008-01/msg00001.php
http://archives.postgresql.org/pgsql-performance/2010-06/msg00452.php

If your symptoms match those, the best fix would be to update to 8.4.x
or later, but a stopgap solution would be to cut down on the number of
idle backends.

regards, tom lane

That sounds somewhat close to the same issue I am seeing. Main
differences being that my spike lasts for much longer than a few
minutes, and can only be resolved when the cluster is restarted. Also,
that second link shows TOP where much of the CPU is via the 'user',
rather than the 'sys' like mine.

Is there anything I can look at more to get more info on this 'sinval
que contention problem'?

Also, having my cpu usage high in 'sys' rather than 'us', could that be
a red flag? Or is that normal?

- Brian F

#12

Alan Hodgson

ahodgson@simkin.ca

over 14 years ago

In reply to: Brian Fehrle (#6)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On October 27, 2011 01:09:51 PM Brian Fehrle wrote:

We've restarted the postgresql cluster, so the issue is not happening at
this moment. but running a vmstat 10 had my 'cs' average at 3K and 'in'
averaging around 9.5K.

Random thought, is there any chance the server is physically overheating? I've
seen CPUs throttle really low when overheating, which can make otherwise
normal activity seem really slow.

#13

Radosław Smogura

rsmogura@softperience.eu

over 14 years ago

In reply to: John R Pierce (#2)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On Thu, 27 Oct 2011 12:43:00 -0700, John R Pierce wrote:

On 10/27/11 11:39 AM, Brian Fehrle wrote:

I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's <IDLE>. On average, at any given time, there
are 3 - 6 connections that are actually executing a query, while the
rest are <IDLE>.

thats not a very effective use of pooling. the pooling model, you'd
have a connection pool sufficient actual database connections to
satisfy your concurrency requirements, and your apps would grab a
connection from the pool, do a transaction, then release the
connection back to the pool.

now, I don't know that this has anything to do with your performance
problem, I'm just pointing out this anomaly. a pool doesn't do much
good if the clients grab a connection and just sit on it.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

It is good model, he have 3-6 connection at one time, so it's look
quite clear that icrease of concurrent connections is caused by
unexpected background processing.

#14

David Kerr

dmk@mr-paradox.net

over 14 years ago

In reply to: Brian Fehrle (#6)

Re: Server hitting 100% CPU usage, system comes to a crawl.

On Thu, Oct 27, 2011 at 02:09:51PM -0600, Brian Fehrle wrote:
- On 10/27/2011 01:48 PM, Scott Marlowe wrote:
- >On Thu, Oct 27, 2011 at 12:39 PM, Brian Fehrle
- ><brianf@consistentstate.com> wrote:
- >>Looking at top, I see no SWAP usage, very little IOWait, and there are a
- >>large number of postmaster processes at 100% cpu usage (makes sense, at
- >>this
- >>point there are 150 or so queries currently executing on the database).
- >>
- >> Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie
- >>Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si,
- >> 0.2%st
- >>Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers
- >>Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached
- >OK, a few points. 1: You've got a zombie process. Find out what's
- >causing that, it could be a trigger of some type for this behaviour.
- >2: You're 92% sys. That's bad. It means the OS is chewing up 92% of
- >your 32 cores doing something. what tasks are at the top of the list
- >in top?
- >
- Out of the top 50 processes in top, 48 of them are postmasters, one is
- syslog, and one is psql. Each of the postmasters have a high %CPU, the
- top ones being 80% and higher, the rest being anywhere between 30% -
- 60%. Would postmaster 'queries' that are running attribute to the sys
- CPU usage, or should they be under the 'us' CPU usage?

total spitball here but - I had something similar happen once and it
was syslog causing the problem.

Are you using regular vanilla syslog? or syslog-ng/rsyslog? my problem
was vanilla syslog. When I moved to -ng/rsyslog or logging to a file
my problem went away.

Dave

#15

Brian Fehrle

brianf@consistentstate.com

over 14 years ago

In reply to: Brian Fehrle (#11)

Re: Server hitting 100% CPU usage, system comes to a crawl.

Update on this:

We did a switchover to another machine with the same hardware, however
this system was running on some older parameters we had set in the
postgresql.conf file.

So we went from 400 max_connections to 200 max_connections, and 160MB
work_mem to 200MB work_mem. And now on this other system, so far it
seems to be running ok.

Other than the obvious fact that each connection has a certain amount of
memory usage, is there something else to watch for when increasing
connections to numbers like 400? When we had the issue of the system
jumping to 100% cpu usage, even at that point our number of backends to
the cluster was at MAX 250, but generally in the 175 range, so well
below our 400 max_connections we allow. So could this be the culprit?

I'll be watching the cluster as we run on the new configuration (with
only 200 max_connections).

- Brian F

Show quoted text

On 10/27/2011 03:22 PM, Brian Fehrle wrote:

On 10/27/2011 02:50 PM, Tom Lane wrote:

Brian Fehrle<brianf@consistentstate.com> writes:

Hi all, need some help/clues on tracking down a performance issue.
PostgreSQL version: 8.3.11
I've got a system that has 32 cores and 128 gigs of ram. We have
connection pooling set up, with about 100 - 200 persistent connections
open to the database. Our applications then use these connections to
query the database constantly, but when a connection isn't currently
executing a query, it's<IDLE>. On average, at any given time, there are
3 - 6 connections that are actually executing a query, while the rest
are<IDLE>.
About once a day, queries that normally take just a few seconds slow
way
down, and start to pile up, to the point where instead of just having
3-6 queries running at any given time, we get 100 - 200. The whole
system comes to a crawl, and looking at top, the CPU usage is 99%.

This is jumping to a conclusion based on insufficient data, but what you
describe sounds a bit like the sinval queue contention problems that we
fixed in 8.4. Some prior reports of that:
http://archives.postgresql.org/pgsql-performance/2008-01/msg00001.php
http://archives.postgresql.org/pgsql-performance/2010-06/msg00452.php

If your symptoms match those, the best fix would be to update to 8.4.x
or later, but a stopgap solution would be to cut down on the number of
idle backends.

regards, tom lane

That sounds somewhat close to the same issue I am seeing. Main
differences being that my spike lasts for much longer than a few
minutes, and can only be resolved when the cluster is restarted. Also,
that second link shows TOP where much of the CPU is via the 'user',
rather than the 'sys' like mine.

Is there anything I can look at more to get more info on this 'sinval
que contention problem'?

Also, having my cpu usage high in 'sys' rather than 'us', could that
be a red flag? Or is that normal?

- Brian F

#16

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Brian Fehrle (#15)

Re: Server hitting 100% CPU usage, system comes to a crawl.

Brian Fehrle <brianf@consistentstate.com> writes:

Update on this:
We did a switchover to another machine with the same hardware, however
this system was running on some older parameters we had set in the
postgresql.conf file.

So we went from 400 max_connections to 200 max_connections, and 160MB
work_mem to 200MB work_mem. And now on this other system, so far it
seems to be running ok.

Other than the obvious fact that each connection has a certain amount of
memory usage, is there something else to watch for when increasing
connections to numbers like 400? When we had the issue of the system
jumping to 100% cpu usage, even at that point our number of backends to
the cluster was at MAX 250, but generally in the 175 range, so well
below our 400 max_connections we allow. So could this be the culprit?

Well, yeah, the pre-8.4 sinval problems I mentioned scale with the
number of live backends. When you have many more backends in the system
that will contribute to the problem, even --- in fact, especially --- if
the extra ones are idle.

regards, tom lane