High SYS CPU - need advise
Hello everyone,
I'm seeking help in diagnosing / figuring out the issue that we have with
our DB server:
Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's
pure SYS cpu, i.e. it's not io wait, not irq, not user). Postgresql is
taking 10-15% at the same time. Those periods would last from few seconds,
to minutes or until Postgresql is restarted. Needless to say that system is
barely responsive, with load average hitting over 100. We have mostly
select statements (joins across few tables), using indexes and resulting in
a small number of records returned. Should number of requests per second
coming drop a bit, server does not fall into those HIGH-SYS-CPU periods. It
all seems like postgres runs out of some resources or fighting for some
locks and that causing kernel to go into la-la land trying to manage it.
So far we've checked:
- disk and nic delays / errors / utilization
- WAL files (created rarely)
- tables are vacuumed OK. periods of high SYS not tied to vacuum process.
- kernel resources utilization (sufficient FS handles, shared MEM/SEM, VM)
- increased log level, but nothing suspicious/different (to me) is reported
there during periods of high sys-cpu
- ran pgbench (could not reproduce the issue, even though it was producing
over 40,000 TPS for prolonged period of time)
Basically, our symptoms are exactly as was reported here over a year ago
(though for postgres 8.3, we ran 9.1):
http://archives.postgresql.org/pgsql-general/2011-10/msg00998.php
I will be grateful for any ideas helping to resolve or diagnose this
problem.
Environment background:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
Linux 3.5.5 (Fedora 17 x64) on 32Gb RAM / 8 cores
Default configuration changed:
max_connection = 1200
shared_buffers = 3200MB
temp_buffers = 18MB
max_prepared_transactions = 500
work_mem = 16MB
maintenance_work_mem = 64MB
max_files_per_process = 3000
wal_level = hot_standby
fsync = off
checkpoint_segments = 64
checkpoint_timeout = 15min
effective_cache_size = 8GB
default_statistics_target = 500
-- Vlad
On 11/14/12 1:13 PM, Vlad wrote:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
thats a really high client connection count for a 8 core system.
I'd consider implementing a connection pool (like pgbouncer), and
rewriting your client applications to connect, do a transaction,
disconnect, so the actual number of postgres connections is much lower,
say in the 16-48 range.
--
john r pierce N 37, W 122
santa cruz ca mid-left coast
John,
thanks for your feedback. While implementing connection pooling would make
resources utilization more efficient, I don't think it's the root of my
problem. Most of the connected clients are at IDLE. When I do
select * from pg_stat_activity where current_query not like '%IDLE%';
I only see several active queries at any given time.
-- Vlad
On Wed, Nov 14, 2012 at 3:23 PM, John R Pierce <pierce@hogranch.com> wrote:
Show quoted text
On 11/14/12 1:13 PM, Vlad wrote:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.thats a really high client connection count for a 8 core system.
I'd consider implementing a connection pool (like pgbouncer), and
rewriting your client applications to connect, do a transaction,
disconnect, so the actual number of postgres connections is much lower, say
in the 16-48 range.
On 11/14/12 1:34 PM, Vlad wrote:
thanks for your feedback. While implementing connection pooling would
make resources utilization more efficient, I don't think it's the root
of my problem. Most of the connected clients are at IDLE. When I doselect * from pg_stat_activity where current_query not like '%IDLE%';
I only see several active queries at any given time.
what about during these spikes?
--
john r pierce N 37, W 122
santa cruz ca mid-left coast
On Wed, Nov 14, 2012 at 4:08 PM, John R Pierce <pierce@hogranch.com> wrote:
On 11/14/12 1:34 PM, Vlad wrote:
thanks for your feedback. While implementing connection pooling would make
resources utilization more efficient, I don't think it's the root of my
problem. Most of the connected clients are at IDLE. When I doselect * from pg_stat_activity where current_query not like '%IDLE%';
I only see several active queries at any given time.
what about during these spikes?
Yeah -- if you are seeing a lot of queries pile up during these times,
it's time to explore connection pooler because it will keep system
load manageable during these situations.
things to check:
*) blocked queries (pg_locks/pg_stat_activity)
*) i/o wait. in particular, is linux page cache flushing.
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
*) you have a large amount of table? maybe this statistics file related?
*) let's log checkpoints to see if there is correlation with stall
*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.
merlin
there is no big spike of queries that cause that, queries come in
relatively stable pace. It's just when the higher rate of queries coming,
the more likely this to happen. yes, when stall happens , the active
queries pile up - but that's the result of a stall (the server reacts slow
on a keypress, not to mention queries execution), not the cause.
things to check:
*) blocked queries (pg_locks/pg_stat_activity)
nada
*) i/o wait. in particular, is linux page cache flushing.
no i/o wait, no IRQ (see below)
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
just bare linux + postgresql.
*) you have a large amount of table? maybe this statistics file related?
over 1000 tables across 4 or 5 schemas in a single database.
*) let's log checkpoints to see if there is correlation with stall
checked, checkpoints happen must more rarely and w/o relation to a high-sys
periods
*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.
checked that as well - nothing. CS even lower.
avg-cpu: %user %nice %system %iowait %steal %idle
16.94 0.00 9.28 0.38 0.00 73.40
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 6.00 48.00 0.00 48 0
avg-cpu: %user %nice %system %iowait %steal %idle
18.06 0.00 18.43 0.25 0.00 63.26
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 104.00 0.00 104 0
avg-cpu: %user %nice %system %iowait %steal %idle
9.12 0.00 * 86.74* 0.12 0.00 4.01
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.97 7.77 0.00 8 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.44 0.00 *96.58* 0.00 0.00 1.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 3.28 78.69 0.00 144 0
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9
76 0 0
4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9
79 1 0
0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10
73 0 0
6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17
64 0 0
30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 *86
*5 0 0
72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 *98
*0 0 0
78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 *94
*5 0 0
22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 *90
*1 0 0
On Thu, Nov 15, 2012 at 11:50 AM, Vlad <marchenko@gmail.com> wrote:
there is no big spike of queries that cause that, queries come in relatively
stable pace. It's just when the higher rate of queries coming, the more
likely this to happen. yes, when stall happens , the active queries pile up
- but that's the result of a stall (the server reacts slow on a keypress,
not to mention queries execution), not the cause.procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9
76 0 0
4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9
79 1 0
0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10
73 0 0
6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17
64 0 0
30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 86
5 0 0
72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 98
0 0 0
78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 94
5 0 0
22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 90
1 0 0
hm. well, we can definitely rule out i/o. I reviewed your last
posting, and you said:
"Out of the top 50 processes in top, 48 of them are postmasters, one
is syslog, and one is psql. Each of the postmasters have a high %CPU,
the top ones being 80% and higher, the rest being anywhere between 30%
- 60%. Would postmaster 'queries' that are running attribute to the
sys CPU usage, or should they be under the 'us' CPU usage?"
Is this still true? Can we capture strace from one of the high %
postmasters to see if there's any clues there. Maybe we've uncovered
some type of weird spinlock contention issue. How large is your
database (or at least the portion of it that's commonly used)? Would
you characterize your queries as mostly small lookups, scans, or a
mix?
merlin
Merlin,
this is not my report, probably from a thread that I've referenced as
having a common symptoms. Here is info about my db:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We have mostly select statements (joins across few tables), using indexes
and resulting in a small number of records returned.
So mostly small lookups across few tables joining by a primary key / index.
Also, I just perform an experiment - I switched our app over to using hot
PITR instead of master and it experienced the same problem. So since PITR
db can only perform read-only queries, there is no write-locks (except
maybe when pitr is playing wal records from the master?), nevertheless SYS
CPU jumped sky.
-- Vlad
Show quoted text
Is this still true? Can we capture strace from one of the high %
postmasters to see if there's any clues there. Maybe we've uncovered
some type of weird spinlock contention issue. How large is your
database (or at least the portion of it that's commonly used)? Would
you characterize your queries as mostly small lookups, scans, or a
mix?merlin
On Thu, Nov 15, 2012 at 2:20 PM, Vlad <marchenko@gmail.com> wrote:
Merlin,
this is not my report, probably from a thread that I've referenced as having
a common symptoms. Here is info about my db:Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We have mostly select statements (joins across few tables), using indexes
and resulting in a small number of records returned.So mostly small lookups across few tables joining by a primary key / index.
Also, I just perform an experiment - I switched our app over to using hot
PITR instead of master and it experienced the same problem. So since PITR db
can only perform read-only queries, there is no write-locks (except maybe
when pitr is playing wal records from the master?), nevertheless SYS CPU
jumped sky.
yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
*) if, so, please strace that process and save some of the log
*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode
merlin
yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
yes, CPU is spread across a lot of postmasters
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster
*) if, so, please strace that process and save some of the log
https://dl.dropbox.com/u/109778/stall_postmaster.log
*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.
this was observed for a while, during which period system went from using
3.4.* kernels to 3.5.*... but I do not deny such a possibility.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
Yes, definitely possible. we run manually compiled postgresql anyway. Pls,
provide instructions.
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode
I presume you wanted cat instead of echo, and it shows 0.
-- vlad
On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote:
yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall timeyes, CPU is spread across a lot of postmasters
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster*) if, so, please strace that process and save some of the log
ok, excellent. reviewing the log, this immediately caught my eye:
recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192
this is definitely pointing to spinlock issue. see: slock.c what you
are seeing here is a backend getting caught in a spin loop via a stuck
spinlock. 0-9 = 10 times. did you by any chance check the logs? did
any backends restart? we are looking for this:
elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
lock, file, line);
Anything else going on in the log? The way to enable locks status is
via LWLOCK_STATS macro. But before doing any thing plz check logs for
error/panic.
merlin
On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@gmail.com> wrote:
yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall timeyes, CPU is spread across a lot of postmasters
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster*) if, so, please strace that process and save some of the log
ok, excellent. reviewing the log, this immediately caught my eye:
recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192this is definitely pointing to spinlock issue. see: slock.c what you
are seeing here is a backend getting caught in a spin loop via a stuck
spinlock. 0-9 = 10 times. did you by any chance check the logs? did
any backends restart? we are looking for this:elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
lock, file, line);Anything else going on in the log? The way to enable locks status is
via LWLOCK_STATS macro. But before doing any thing plz check logs for
error/panic.
ah, scratch that. reading the code again, slock.c allows for for up
to 1 second (that is, you can delay 1000 times). so the panic is
unlikely to be in the log. also spinlock waits shouldn't increase by
exactly 1000 us. investigating...
merlin
Merlin Moncure escribió:
ok, excellent. reviewing the log, this immediately caught my eye:
recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192this is definitely pointing to spinlock issue.
I met Rik van Riel (Linux kernel hacker) recently and we chatted about
this briefly. He strongly suggested that we should consider using
futexes on Linux instead of spinlocks; the big advantage being that
futexes sleep instead of spinning when contention is high. That would
reduce the system load in this scenario.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Merlin Moncure escribió:
ok, excellent. reviewing the log, this immediately caught my eye:
recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192this is definitely pointing to spinlock issue.
I met Rik van Riel (Linux kernel hacker) recently and we chatted about
this briefly. He strongly suggested that we should consider using
futexes on Linux instead of spinlocks; the big advantage being that
futexes sleep instead of spinning when contention is high. That would
reduce the system load in this scenario.
Well, so do postgres spinlocks right? When we overflow
spins_per_delay we go to pg_usleep which proxies to select() --
postgres spinlocks are a hybrid implementation. Moving to futex is
possible improvement (that's another discussion) in degenerate cases
but I'm not sure that I've exactly zeroed in on the problem. Or am I
missing something?
What I've been scratching my head over is what code exactly would
cause an iterative sleep like the above. The code is here:
pg_usleep(cur_delay * 1000L);
/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;
...so cur_delay is supposed to increase in non linear fashion. I've
looked around the sleep, usleep, and latch calls as of yet haven't
found anything that advances timeout just like that (yet, need to do
another pass). And we don't know for sure if this is directly related
to OP's problem.
merlin
Merlin Moncure <mmoncure@gmail.com> writes:
What I've been scratching my head over is what code exactly would
cause an iterative sleep like the above. The code is here:
pg_usleep(cur_delay * 1000L);
/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;
IIRC that coding is of relatively recent vintage. The OP is probably
running some older version that increased cur_delay differently.
regards, tom lane
Tom,
I just checked the version I'm running (9.1.6), and the code is quite
similar (src/backend/storage/lmgr/s_lock.c)
pg_usleep(cur_delay * 1000L);
#if defined(S_LOCK_TEST)
fprintf(stdout, "*");
fflush(stdout);
#endif
/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) +
0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;
-- vlad
On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
This is not entirely inconsistent with the spinlock. Note that 1000
is repeated 3 times, and 5000 is missing.
This might just be a misleading random sample we got here. I've seen
similar close spacing in some simulations I've run.
It is not clear to me why we use a resolution of 1 msec here. If the
OS's implementation of select() eventually rounds to the nearest msec,
that is its business. But why do we have to lose intermediate
precision due to its decision?
Cheers,
Jeff
On Thu, Nov 15, 2012 at 6:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)This is not entirely inconsistent with the spinlock. Note that 1000
is repeated 3 times, and 5000 is missing.This might just be a misleading random sample we got here. I've seen
similar close spacing in some simulations I've run.It is not clear to me why we use a resolution of 1 msec here. If the
OS's implementation of select() eventually rounds to the nearest msec,
that is its business. But why do we have to lose intermediate
precision due to its decision?
Yeah -- you're right, this is definitely spinlock issue. Next steps:
*) in mostly read workloads, we have a couple of known frequent
offenders. In particular the 'BufFreelistLock'. One way we can
influence that guy is to try and significantly lower/raise shared
buffers. So this is one thing to try.
*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on. Hopefully
it's a lwlock -- this will make diagnosing the problem easier.
*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan. This long thread:
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html"
has a lot information about that case and deserves a review.
*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve. This is dangerous, and could crash your
server/eat your data, so fair warning.
merlin
Merlin,
Yeah -- you're right, this is definitely spinlock issue. Next steps:
*) in mostly read workloads, we have a couple of known frequent
offenders. In particular the 'BufFreelistLock'. One way we can
influence that guy is to try and significantly lower/raise shared
buffers. So this is one thing to try.
server has 32gb with 3.2gb dedicated for share buffers. I've increased
it to 13.2gb, the stall still happened (I have a way of controlling
number of queries hitting postgresql by shutting down own memcache-based
app cache, so to test for stall I temporary shut down few memcached
servers).
*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on. Hopefully
it's a lwlock -- this will make diagnosing the problem easier.
I've enabled that macro, seeing flying lwlock messages in the log (see
below), even when there is no high-sys-cpu stall observed at the moment.
Should I be looking for something in particular?
PID 17293 lwlock 5906: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5932: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5934: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10854: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10856: shacq 4 exacq 0 blk 0
PID 17293 lwlock 10858: shacq 15 exacq 0 blk 0
PID 17293 lwlock 16442: shacq 4 exacq 0 blk 0
PID 17293 lwlock 16596: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20458: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20460: shacq 10 exacq 0 blk 0
PID 17293 lwlock 20464: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20466: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20480: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20482: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20484: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20508: shacq 12 exacq 0 blk 0
PID 17293 lwlock 20510: shacq 6 exacq 0 blk 0
PID 17293 lwlock 20938: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20940: shacq 22 exacq 0 blk 0
PID 17293 lwlock 20942: shacq 28 exacq 0 blk 0
PID 17293 lwlock 20944: shacq 56 exacq 0 blk 0
PID 17293 lwlock 20946: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20948: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20950: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20952: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20954: shacq 7 exacq 0 blk 0
PID 17293 lwlock 20956: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20958: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20960: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20962: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20964: shacq 2 exacq 0 blk 0
PID 17293 lwlock 46624: shacq 24 exacq 0 blk 0
PID 17293 lwlock 68126: shacq 6 exacq 0 blk 0
*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan. This long thread:
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html"
has a lot information about that case and deserves a review.*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve. This is dangerous, and could crash your
server/eat your data, so fair warning.merlin
-- vlad