High SYS CPU - need advise

Started by Vladover 13 years ago72 messagesgeneral
Jump to latest
#1Vlad
marchenko@gmail.com

Hello everyone,

I'm seeking help in diagnosing / figuring out the issue that we have with
our DB server:

Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's
pure SYS cpu, i.e. it's not io wait, not irq, not user). Postgresql is
taking 10-15% at the same time. Those periods would last from few seconds,
to minutes or until Postgresql is restarted. Needless to say that system is
barely responsive, with load average hitting over 100. We have mostly
select statements (joins across few tables), using indexes and resulting in
a small number of records returned. Should number of requests per second
coming drop a bit, server does not fall into those HIGH-SYS-CPU periods. It
all seems like postgres runs out of some resources or fighting for some
locks and that causing kernel to go into la-la land trying to manage it.

So far we've checked:
- disk and nic delays / errors / utilization
- WAL files (created rarely)
- tables are vacuumed OK. periods of high SYS not tied to vacuum process.
- kernel resources utilization (sufficient FS handles, shared MEM/SEM, VM)
- increased log level, but nothing suspicious/different (to me) is reported
there during periods of high sys-cpu
- ran pgbench (could not reproduce the issue, even though it was producing
over 40,000 TPS for prolonged period of time)

Basically, our symptoms are exactly as was reported here over a year ago
(though for postgres 8.3, we ran 9.1):
http://archives.postgresql.org/pgsql-general/2011-10/msg00998.php

I will be grateful for any ideas helping to resolve or diagnose this
problem.

Environment background:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

Linux 3.5.5 (Fedora 17 x64) on 32Gb RAM / 8 cores

Default configuration changed:
max_connection = 1200
shared_buffers = 3200MB
temp_buffers = 18MB
max_prepared_transactions = 500
work_mem = 16MB
maintenance_work_mem = 64MB
max_files_per_process = 3000
wal_level = hot_standby
fsync = off
checkpoint_segments = 64
checkpoint_timeout = 15min
effective_cache_size = 8GB
default_statistics_target = 500

-- Vlad

#2John R Pierce
pierce@hogranch.com
In reply to: Vlad (#1)
Re: High SYS CPU - need advise

On 11/14/12 1:13 PM, Vlad wrote:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

thats a really high client connection count for a 8 core system.

I'd consider implementing a connection pool (like pgbouncer), and
rewriting your client applications to connect, do a transaction,
disconnect, so the actual number of postgres connections is much lower,
say in the 16-48 range.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#3Vlad
marchenko@gmail.com
In reply to: John R Pierce (#2)
Re: High SYS CPU - need advise

John,

thanks for your feedback. While implementing connection pooling would make
resources utilization more efficient, I don't think it's the root of my
problem. Most of the connected clients are at IDLE. When I do

select * from pg_stat_activity where current_query not like '%IDLE%';

I only see several active queries at any given time.

-- Vlad

On Wed, Nov 14, 2012 at 3:23 PM, John R Pierce <pierce@hogranch.com> wrote:

Show quoted text

On 11/14/12 1:13 PM, Vlad wrote:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

thats a really high client connection count for a 8 core system.

I'd consider implementing a connection pool (like pgbouncer), and
rewriting your client applications to connect, do a transaction,
disconnect, so the actual number of postgres connections is much lower, say
in the 16-48 range.

#4John R Pierce
pierce@hogranch.com
In reply to: Vlad (#3)
Re: High SYS CPU - need advise

On 11/14/12 1:34 PM, Vlad wrote:

thanks for your feedback. While implementing connection pooling would
make resources utilization more efficient, I don't think it's the root
of my problem. Most of the connected clients are at IDLE. When I do

select * from pg_stat_activity where current_query not like '%IDLE%';

I only see several active queries at any given time.

what about during these spikes?

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#5Merlin Moncure
mmoncure@gmail.com
In reply to: John R Pierce (#4)
Re: High SYS CPU - need advise

On Wed, Nov 14, 2012 at 4:08 PM, John R Pierce <pierce@hogranch.com> wrote:

On 11/14/12 1:34 PM, Vlad wrote:

thanks for your feedback. While implementing connection pooling would make
resources utilization more efficient, I don't think it's the root of my
problem. Most of the connected clients are at IDLE. When I do

select * from pg_stat_activity where current_query not like '%IDLE%';

I only see several active queries at any given time.

what about during these spikes?

Yeah -- if you are seeing a lot of queries pile up during these times,
it's time to explore connection pooler because it will keep system
load manageable during these situations.

things to check:
*) blocked queries (pg_locks/pg_stat_activity)
*) i/o wait. in particular, is linux page cache flushing.
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
*) you have a large amount of table? maybe this statistics file related?
*) let's log checkpoints to see if there is correlation with stall
*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.

merlin

#6Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#5)
Re: High SYS CPU - need advise

there is no big spike of queries that cause that, queries come in
relatively stable pace. It's just when the higher rate of queries coming,
the more likely this to happen. yes, when stall happens , the active
queries pile up - but that's the result of a stall (the server reacts slow
on a keypress, not to mention queries execution), not the cause.

things to check:

*) blocked queries (pg_locks/pg_stat_activity)

nada

*) i/o wait. in particular, is linux page cache flushing.

no i/o wait, no IRQ (see below)

*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?

just bare linux + postgresql.

*) you have a large amount of table? maybe this statistics file related?

over 1000 tables across 4 or 5 schemas in a single database.

*) let's log checkpoints to see if there is correlation with stall

checked, checkpoints happen must more rarely and w/o relation to a high-sys
periods

*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.

checked that as well - nothing. CS even lower.

avg-cpu: %user %nice %system %iowait %steal %idle
16.94 0.00 9.28 0.38 0.00 73.40

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 6.00 48.00 0.00 48 0

avg-cpu: %user %nice %system %iowait %steal %idle
18.06 0.00 18.43 0.25 0.00 63.26

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 104.00 0.00 104 0

avg-cpu: %user %nice %system %iowait %steal %idle
9.12 0.00 * 86.74* 0.12 0.00 4.01

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.97 7.77 0.00 8 0

avg-cpu: %user %nice %system %iowait %steal %idle
1.44 0.00 *96.58* 0.00 0.00 1.98

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 3.28 78.69 0.00 144 0

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9
76 0 0
4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9
79 1 0
0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10
73 0 0
6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17
64 0 0
30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 *86
*5 0 0
72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 *98
*0 0 0
78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 *94
*5 0 0
22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 *90
*1 0 0

#7Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#6)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 11:50 AM, Vlad <marchenko@gmail.com> wrote:

there is no big spike of queries that cause that, queries come in relatively
stable pace. It's just when the higher rate of queries coming, the more
likely this to happen. yes, when stall happens , the active queries pile up
- but that's the result of a stall (the server reacts slow on a keypress,
not to mention queries execution), not the cause.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9
76 0 0
4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9
79 1 0
0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10
73 0 0
6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17
64 0 0
30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 86
5 0 0
72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 98
0 0 0
78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 94
5 0 0
22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 90
1 0 0

hm. well, we can definitely rule out i/o. I reviewed your last
posting, and you said:
"Out of the top 50 processes in top, 48 of them are postmasters, one
is syslog, and one is psql. Each of the postmasters have a high %CPU,
the top ones being 80% and higher, the rest being anywhere between 30%
- 60%. Would postmaster 'queries' that are running attribute to the
sys CPU usage, or should they be under the 'us' CPU usage?"

Is this still true? Can we capture strace from one of the high %
postmasters to see if there's any clues there. Maybe we've uncovered
some type of weird spinlock contention issue. How large is your
database (or at least the portion of it that's commonly used)? Would
you characterize your queries as mostly small lookups, scans, or a
mix?

merlin

#8Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#7)
Re: High SYS CPU - need advise

Merlin,

this is not my report, probably from a thread that I've referenced as
having a common symptoms. Here is info about my db:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We have mostly select statements (joins across few tables), using indexes
and resulting in a small number of records returned.

So mostly small lookups across few tables joining by a primary key / index.
Also, I just perform an experiment - I switched our app over to using hot
PITR instead of master and it experienced the same problem. So since PITR
db can only perform read-only queries, there is no write-locks (except
maybe when pitr is playing wal records from the master?), nevertheless SYS
CPU jumped sky.

-- Vlad

Show quoted text

Is this still true? Can we capture strace from one of the high %
postmasters to see if there's any clues there. Maybe we've uncovered
some type of weird spinlock contention issue. How large is your
database (or at least the portion of it that's commonly used)? Would
you characterize your queries as mostly small lookups, scans, or a
mix?

merlin

#9Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#8)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 2:20 PM, Vlad <marchenko@gmail.com> wrote:

Merlin,

this is not my report, probably from a thread that I've referenced as having
a common symptoms. Here is info about my db:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We have mostly select statements (joins across few tables), using indexes
and resulting in a small number of records returned.

So mostly small lookups across few tables joining by a primary key / index.
Also, I just perform an experiment - I switched our app over to using hot
PITR instead of master and it experienced the same problem. So since PITR db
can only perform read-only queries, there is no write-locks (except maybe
when pitr is playing wal records from the master?), nevertheless SYS CPU
jumped sky.

yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
*) if, so, please strace that process and save some of the log
*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode

merlin

#10Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#9)
Re: High SYS CPU - need advise

yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time

yes, CPU is spread across a lot of postmasters

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster

*) if, so, please strace that process and save some of the log

https://dl.dropbox.com/u/109778/stall_postmaster.log

*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.

this was observed for a while, during which period system went from using
3.4.* kernels to 3.5.*... but I do not deny such a possibility.

*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?

Yes, definitely possible. we run manually compiled postgresql anyway. Pls,
provide instructions.

*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode

I presume you wanted cat instead of echo, and it shows 0.

-- vlad

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#10)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote:

yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time

yes, CPU is spread across a lot of postmasters

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster

*) if, so, please strace that process and save some of the log

https://dl.dropbox.com/u/109778/stall_postmaster.log

ok, excellent. reviewing the log, this immediately caught my eye:

recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192

this is definitely pointing to spinlock issue. see: slock.c what you
are seeing here is a backend getting caught in a spin loop via a stuck
spinlock. 0-9 = 10 times. did you by any chance check the logs? did
any backends restart? we are looking for this:

elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
lock, file, line);

Anything else going on in the log? The way to enable locks status is
via LWLOCK_STATS macro. But before doing any thing plz check logs for
error/panic.

merlin

#12Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#11)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote:

yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time

yes, CPU is spread across a lot of postmasters

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster

*) if, so, please strace that process and save some of the log

https://dl.dropbox.com/u/109778/stall_postmaster.log

ok, excellent. reviewing the log, this immediately caught my eye:

recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192

this is definitely pointing to spinlock issue. see: slock.c what you
are seeing here is a backend getting caught in a spin loop via a stuck
spinlock. 0-9 = 10 times. did you by any chance check the logs? did
any backends restart? we are looking for this:

elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
lock, file, line);

Anything else going on in the log? The way to enable locks status is
via LWLOCK_STATS macro. But before doing any thing plz check logs for
error/panic.

ah, scratch that. reading the code again, slock.c allows for for up
to 1 second (that is, you can delay 1000 times). so the panic is
unlikely to be in the log. also spinlock waits shouldn't increase by
exactly 1000 us. investigating...

merlin

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Merlin Moncure (#11)
Re: High SYS CPU - need advise

Merlin Moncure escribió:

ok, excellent. reviewing the log, this immediately caught my eye:

recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192

this is definitely pointing to spinlock issue.

I met Rik van Riel (Linux kernel hacker) recently and we chatted about
this briefly. He strongly suggested that we should consider using
futexes on Linux instead of spinlocks; the big advantage being that
futexes sleep instead of spinning when contention is high. That would
reduce the system load in this scenario.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#11)
Re: High SYS CPU - need advise

sorry - no panics / errors in the log...

-- Vlad

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#13)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Merlin Moncure escribió:

ok, excellent. reviewing the log, this immediately caught my eye:

recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192

this is definitely pointing to spinlock issue.

I met Rik van Riel (Linux kernel hacker) recently and we chatted about
this briefly. He strongly suggested that we should consider using
futexes on Linux instead of spinlocks; the big advantage being that
futexes sleep instead of spinning when contention is high. That would
reduce the system load in this scenario.

Well, so do postgres spinlocks right? When we overflow
spins_per_delay we go to pg_usleep which proxies to select() --
postgres spinlocks are a hybrid implementation. Moving to futex is
possible improvement (that's another discussion) in degenerate cases
but I'm not sure that I've exactly zeroed in on the problem. Or am I
missing something?

What I've been scratching my head over is what code exactly would
cause an iterative sleep like the above. The code is here:

pg_usleep(cur_delay * 1000L);

/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;

...so cur_delay is supposed to increase in non linear fashion. I've
looked around the sleep, usleep, and latch calls as of yet haven't
found anything that advances timeout just like that (yet, need to do
another pass). And we don't know for sure if this is directly related
to OP's problem.

merlin

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#15)
Re: High SYS CPU - need advise

Merlin Moncure <mmoncure@gmail.com> writes:

What I've been scratching my head over is what code exactly would
cause an iterative sleep like the above. The code is here:

pg_usleep(cur_delay * 1000L);

/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;

IIRC that coding is of relatively recent vintage. The OP is probably
running some older version that increased cur_delay differently.

regards, tom lane

#17Vlad
marchenko@gmail.com
In reply to: Tom Lane (#16)
Re: High SYS CPU - need advise

Tom,

I just checked the version I'm running (9.1.6), and the code is quite
similar (src/backend/storage/lmgr/s_lock.c)

pg_usleep(cur_delay * 1000L);

#if defined(S_LOCK_TEST)
fprintf(stdout, "*");
fflush(stdout);
#endif

/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) +
0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;

-- vlad

#18Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#15)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)

This is not entirely inconsistent with the spinlock. Note that 1000
is repeated 3 times, and 5000 is missing.

This might just be a misleading random sample we got here. I've seen
similar close spacing in some simulations I've run.

It is not clear to me why we use a resolution of 1 msec here. If the
OS's implementation of select() eventually rounds to the nearest msec,
that is its business. But why do we have to lose intermediate
precision due to its decision?

Cheers,

Jeff

#19Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#18)
Re: High SYS CPU - need advise

On Thu, Nov 15, 2012 at 6:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)

This is not entirely inconsistent with the spinlock. Note that 1000
is repeated 3 times, and 5000 is missing.

This might just be a misleading random sample we got here. I've seen
similar close spacing in some simulations I've run.

It is not clear to me why we use a resolution of 1 msec here. If the
OS's implementation of select() eventually rounds to the nearest msec,
that is its business. But why do we have to lose intermediate
precision due to its decision?

Yeah -- you're right, this is definitely spinlock issue. Next steps:

*) in mostly read workloads, we have a couple of known frequent
offenders. In particular the 'BufFreelistLock'. One way we can
influence that guy is to try and significantly lower/raise shared
buffers. So this is one thing to try.

*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on. Hopefully
it's a lwlock -- this will make diagnosing the problem easier.

*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan. This long thread:
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html&quot;
has a lot information about that case and deserves a review.

*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve. This is dangerous, and could crash your
server/eat your data, so fair warning.

merlin

#20Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#19)
Re: High SYS CPU - need advise

Merlin,

Yeah -- you're right, this is definitely spinlock issue. Next steps:

*) in mostly read workloads, we have a couple of known frequent
offenders. In particular the 'BufFreelistLock'. One way we can
influence that guy is to try and significantly lower/raise shared
buffers. So this is one thing to try.

server has 32gb with 3.2gb dedicated for share buffers. I've increased
it to 13.2gb, the stall still happened (I have a way of controlling
number of queries hitting postgresql by shutting down own memcache-based
app cache, so to test for stall I temporary shut down few memcached
servers).

*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on. Hopefully
it's a lwlock -- this will make diagnosing the problem easier.

I've enabled that macro, seeing flying lwlock messages in the log (see
below), even when there is no high-sys-cpu stall observed at the moment.
Should I be looking for something in particular?

PID 17293 lwlock 5906: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5932: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5934: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10854: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10856: shacq 4 exacq 0 blk 0
PID 17293 lwlock 10858: shacq 15 exacq 0 blk 0
PID 17293 lwlock 16442: shacq 4 exacq 0 blk 0
PID 17293 lwlock 16596: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20458: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20460: shacq 10 exacq 0 blk 0
PID 17293 lwlock 20464: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20466: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20480: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20482: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20484: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20508: shacq 12 exacq 0 blk 0
PID 17293 lwlock 20510: shacq 6 exacq 0 blk 0
PID 17293 lwlock 20938: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20940: shacq 22 exacq 0 blk 0
PID 17293 lwlock 20942: shacq 28 exacq 0 blk 0
PID 17293 lwlock 20944: shacq 56 exacq 0 blk 0
PID 17293 lwlock 20946: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20948: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20950: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20952: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20954: shacq 7 exacq 0 blk 0
PID 17293 lwlock 20956: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20958: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20960: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20962: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20964: shacq 2 exacq 0 blk 0
PID 17293 lwlock 46624: shacq 24 exacq 0 blk 0
PID 17293 lwlock 68126: shacq 6 exacq 0 blk 0

*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan. This long thread:
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html&quot;
has a lot information about that case and deserves a review.

*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve. This is dangerous, and could crash your
server/eat your data, so fair warning.

merlin

-- vlad

#21Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#20)
#22Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#21)
#23Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#22)
#24Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#21)
#25Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#24)
#26Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#25)
#27Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#26)
#28Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#27)
#29Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#28)
#30Jeff Janes
jeff.janes@gmail.com
In reply to: Vlad (#26)
#31Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#30)
#32Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#31)
#33Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#32)
#34Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#33)
#35Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#34)
#36Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#31)
#37Shaun Thomas
sthomas@optionshouse.com
In reply to: Merlin Moncure (#27)
#38Merlin Moncure
mmoncure@gmail.com
In reply to: Shaun Thomas (#37)
#39Shaun Thomas
sthomas@optionshouse.com
In reply to: Merlin Moncure (#38)
#40Merlin Moncure
mmoncure@gmail.com
In reply to: Shaun Thomas (#39)
#41Shaun Thomas
sthomas@optionshouse.com
In reply to: Merlin Moncure (#40)
#42Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#38)
#43Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#13)
#44Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#42)
#45Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#44)
#46Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#45)
#47Jeff Janes
jeff.janes@gmail.com
In reply to: Merlin Moncure (#46)
#48Shaun Thomas
sthomas@optionshouse.com
In reply to: Jeff Janes (#47)
#49Jeff Janes
jeff.janes@gmail.com
In reply to: Shaun Thomas (#48)
#50Shaun Thomas
sthomas@optionshouse.com
In reply to: Jeff Janes (#49)
#51Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#47)
#52Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Shaun Thomas (#50)
#53Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#31)
#54Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#53)
#55Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#54)
#56Jeff Janes
jeff.janes@gmail.com
In reply to: Vlad (#55)
#57Vlad
marchenko@gmail.com
In reply to: Jeff Janes (#56)
#58Jeff Janes
jeff.janes@gmail.com
In reply to: Vlad (#53)
#59Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Janes (#58)
#60Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#59)
#61Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#60)
#62Jeff Janes
jeff.janes@gmail.com
In reply to: Vlad (#60)
#63Vlad
marchenko@gmail.com
In reply to: Jeff Janes (#62)
#64Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#63)
#65Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#64)
#66Merlin Moncure
mmoncure@gmail.com
In reply to: Vlad (#65)
#67Vlad
marchenko@gmail.com
In reply to: Merlin Moncure (#66)
#68Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Vlad (#67)
#69Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Kevin Grittner (#68)
#70Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Gavin Flower (#69)
#71Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Kevin Grittner (#70)
#72Vlad
marchenko@gmail.com
In reply to: Gavin Flower (#69)