"stuck spinlock"
Greetings,
Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form:
/var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 21:40:10.328 UTC,"n","n",32376,"10.2.1.142:52451",52aa24eb.7e78,5,"SELECT",2013-12-12 21:04:43 UTC,9/7178,0,PANIC,XX000,"stuck spinlock (0x7f7df94672f4) detected at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099",,,,,,"<redacted>"
uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux.
Generally, there's no core file (which is currently enable), as the postmaster just normally exits the backend.
Diagnosis suggestions?
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Christophe Pettus <xof@thebuild.com> writes:
Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form:
/var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 21:40:10.328 UTC,"n","n",32376,"10.2.1.142:52451",52aa24eb.7e78,5,"SELECT",2013-12-12 21:04:43 UTC,9/7178,0,PANIC,XX000,"stuck spinlock (0x7f7df94672f4) detected at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099",,,,,,"<redacted>"
uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux.
Generally, there's no core file (which is currently enable), as the postmaster just normally exits the backend.
Hm, a PANIC really ought to result in a core file. You sure you don't
have that disabled (perhaps via a ulimit setting)?
As for the root cause, it's hard to say. The file/line number says it's
a buffer header lock that's stuck. I rechecked all the places that lock
buffer headers, and all of them have very short code paths to the
corresponding unlock, so there's no obvious explanation how this could
happen.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote:
Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form:
/var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 21:40:10.328 UTC,"n","n",32376,"10.2.1.142:52451",52aa24eb.7e78,5,"SELECT",2013-12-12 21:04:43 UTC,9/7178,0,PANIC,XX000,"stuck spinlock (0x7f7df94672f4) detected at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099",,,,,,"<redacted>"
Any other changes but the upgrade? Maybe a different compiler version?
Also, could you share some details about the workload? Highly
concurrent? Standby? ...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 12, 2013 at 3:33 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Any other changes but the upgrade? Maybe a different compiler version?
Show pg_config output.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 3:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
Show pg_config output.
Below; it's the Ubuntu package.
BINDIR = /usr/lib/postgresql/9.3/bin
DOCDIR = /usr/share/doc/postgresql-doc-9.3
HTMLDIR = /usr/share/doc/postgresql-doc-9.3
INCLUDEDIR = /usr/include/postgresql
PKGINCLUDEDIR = /usr/include/postgresql
INCLUDEDIR-SERVER = /usr/include/postgresql/9.3/server
LIBDIR = /usr/lib
PKGLIBDIR = /usr/lib/postgresql/9.3/lib
LOCALEDIR = /usr/share/locale
MANDIR = /usr/share/postgresql/9.3/man
SHAREDIR = /usr/share/postgresql/9.3
SYSCONFDIR = /etc/postgresql-common
PGXS = /usr/lib/postgresql/9.3/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--with-tclconfig=/usr/lib/tcl8.5' '--with-tkconfig=/usr/lib/tk8.5' '--with-includes=/usr/include/tcl8.5' 'PYTHON=/usr/bin/python' '--mandir=/usr/share/postgresql/9.3/man' '--docdir=/usr/share/doc/postgresql-doc-9.3' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.3' '--bindir=/usr/lib/postgresql/9.3/bin' '--libdir=/usr/lib/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-debug' '--disable-rpath' '--with-ossp-uuid' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' 'CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -fPIC -pie -I/usr/include/mit-krb5 -DLINUX_OOM_ADJ=0' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi' '--with-ldap' 'CPPFLAGS=-D_FORTIFY_SOURCE=2'
CC = gcc
CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5
CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -fPIC -pie -I/usr/include/mit-krb5 -DLINUX_OOM_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5 -L/usr/lib/x86_64-linux-gnu -Wl,--as-needed
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgport -lpgcommon -lxslt -lxml2 -lpam -lssl -lcrypto -lkrb5 -lcom_err -lgssapi_krb5 -lz -ledit -lcrypt -ldl -lm
VERSION = PostgreSQL 9.3.2
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 3:33 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Any other changes but the upgrade? Maybe a different compiler version?
Just the upgrade; they're using the Ubuntu packages from apt.postgresql.org.
Also, could you share some details about the workload? Highly
concurrent? Standby? ...
The workload is not very highly concurrent; actually quite lightly loaded. There are a very large number (442,000) of user tables. No standby attached.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm, a PANIC really ought to result in a core file. You sure you don't
have that disabled (perhaps via a ulimit setting)?
Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in /etc/postgresql/9.3/main/pg_ctl.conf.
As for the root cause, it's hard to say. The file/line number says it's
a buffer header lock that's stuck. I rechecked all the places that lock
buffer headers, and all of them have very short code paths to the
corresponding unlock, so there's no obvious explanation how this could
happen.
The server was running with shared_buffers=100GB, but the problem has reoccurred now with shared_buffers=16GB.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Christophe Pettus <xof@thebuild.com> writes:
On Dec 12, 2013, at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hm, a PANIC really ought to result in a core file. You sure you don't
have that disabled (perhaps via a ulimit setting)?
Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in /etc/postgresql/9.3/main/pg_ctl.conf.
[ shrug... ] If you aren't getting a core file for a PANIC, then core
files are disabled. I take no position on the value of the setting
you mention above, but I will note that pg_ctl can't override a hard
"ulimit -c 0" system-wide setting.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If you aren't getting a core file for a PANIC, then core
files are disabled.
And just like that, we get one. Stack trace:
#0 0x00007f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f699c81991b in errfinish ()
#3 0x00007f699c81a477 in elog_finish ()
#4 0x00007f699c735db3 in s_lock ()
#5 0x00007f699c71e1f0 in ?? ()
#6 0x00007f699c71eaf9 in ?? ()
#7 0x00007f699c71f53e in ReadBufferExtended ()
#8 0x00007f699c56d03a in index_fetch_heap ()
#9 0x00007f699c67a0b7 in ?? ()
#10 0x00007f699c66e98e in ExecScan ()
#11 0x00007f699c6679a8 in ExecProcNode ()
#12 0x00007f699c67407f in ExecAgg ()
#13 0x00007f699c6678b8 in ExecProcNode ()
#14 0x00007f699c664dd2 in standard_ExecutorRun ()
#15 0x00007f6996ad928d in ?? ()
from /usr/lib/postgresql/9.3/lib/auto_explain.so
#16 0x00007f69968d3525 in ?? ()
from /usr/lib/postgresql/9.3/lib/pg_stat_statements.so
#17 0x00007f699c745207 in ?? ()
#18 0x00007f699c746651 in PortalRun ()
#19 0x00007f699c742960 in PostgresMain ()
#20 0x00007f699c6ff765 in PostmasterMain ()
#21 0x00007f699c53bea2 in main ()
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-12-12 16:22:28 -0800, Christophe Pettus wrote:
On Dec 12, 2013, at 4:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If you aren't getting a core file for a PANIC, then core
files are disabled.And just like that, we get one. Stack trace:
#0 0x00007f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f699c81991b in errfinish ()
#3 0x00007f699c81a477 in elog_finish ()
#4 0x00007f699c735db3 in s_lock ()
#5 0x00007f699c71e1f0 in ?? ()
#6 0x00007f699c71eaf9 in ?? ()
Could you install the -dbg package and regenerate?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 4:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Could you install the -dbg package and regenerate?
Of course!
#0 0x00007f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f699c81991b in errfinish (dummy=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542
#3 0x00007f699c81a477 in elog_finish (elevel=<optimized out>,
fmt=0x7f699c937a48 "stuck spinlock (%p) detected at %s:%d")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297
#4 0x00007f699c735db3 in s_lock_stuck (line=1099,
file=0x7f699c934a78 "/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c", lock=0x7f6585e2cbb4 "\001")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40
#5 s_lock (lock=0x7f6585e2cbb4 "\001",
file=0x7f699c934a78 "/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c", line=1099)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109
#6 0x00007f699c71e1f0 in PinBuffer (buf=0x7f6585e2cb94, strategy=0x0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099
#7 0x00007f699c71eaf9 in BufferAlloc (foundPtr=0x7fff60ec563e "",
strategy=0x0, blockNum=1730, forkNum=MAIN_FORKNUM, relpersistence=112 'p',
smgr=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776
#8 ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=1730, mode=RBM_NORMAL, strategy=0x0,
hit=0x7fff60ec56af "")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333
#9 0x00007f699c71f53e in ReadBufferExtended (reln=0x7f6577d80560,
forkNum=MAIN_FORKNUM, blockNum=1730, mode=<optimized out>,
strategy=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252
#10 0x00007f699c56d03a in index_fetch_heap (scan=0x7f699f94c7a0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/index/indexam.c:515
#11 0x00007f699c67a0b7 in IndexOnlyNext (node=0x7f699f94b690)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeIndexonlyscan.c:109
#12 0x00007f699c66e98e in ExecScanFetch (
recheckMtd=0x7f699c679fb0 <IndexOnlyRecheck>,
accessMtd=0x7f699c679fe0 <IndexOnlyNext>, node=0x7f699f94b690)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c---Type <return> to continue, or q <return> to quit---
:82
#13 ExecScan (node=0x7f699f94b690, accessMtd=0x7f699c679fe0 <IndexOnlyNext>,
recheckMtd=0x7f699c679fb0 <IndexOnlyRecheck>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167
#14 0x00007f699c6679a8 in ExecProcNode (node=0x7f699f94b690)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:408
#15 0x00007f699c67407f in agg_retrieve_direct (aggstate=0x7f699f94af90)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121
#16 ExecAgg (node=0x7f699f94af90)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013
#17 0x00007f699c6678b8 in ExecProcNode (node=0x7f699f94af90)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476
#18 0x00007f699c664dd2 in ExecutePlan (dest=0x7f699f98c308,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x7f699f94af90, estate=0x7f699f94ae80)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472
#19 standard_ExecutorRun (queryDesc=0x7f699f940dc0, direction=<optimized out>,
count=0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307
#20 0x00007f6996ad928d in explain_ExecutorRun (queryDesc=0x7f699f940dc0,
direction=ForwardScanDirection, count=0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233
#21 0x00007f69968d3525 in pgss_ExecutorRun (queryDesc=0x7f699f940dc0,
direction=ForwardScanDirection, count=0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717
#22 0x00007f699c745207 in PortalRunSelect (portal=0x7f699de596a0,
forward=<optimized out>, count=0, dest=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946
#23 0x00007f699c746651 in PortalRun (portal=0x7f699de596a0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f699f98c308,
altdest=0x7f699f98c308, completionTag=0x7fff60ec5f30 "")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:790
#24 0x00007f699c742960 in exec_simple_query (
query_string=0x7f699dd564a0 "SELECT COUNT(*) FROM \"signups\" WHERE (signups.is_supporter = true)")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/postgres.c:1048
#25 PostgresMain (argc=<optimized out>, argv=<optimized out>,
---Type <return> to continue, or q <return> to quit---
dbname=0x7f699dd3e138 "nbuild", username=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/postgres.c:4005
#26 0x00007f699c6ff765 in BackendRun (port=0x7f699dd82050)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:3999
#27 BackendStartup (port=0x7f699dd82050)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:3688
#28 ServerLoop ()
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:1589
#29 PostmasterMain (argc=<optimized out>, argv=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:1258
#30 0x00007f699c53bea2 in main (argc=5, argv=0x7f699dd3c1a0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/main/main.c:196
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 4:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Could you install the -dbg package and regenerate?
Here's another, same system, different crash:
#0 0x00007fa03faf5425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fa03faf8b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fa041e1491b in errfinish (dummy=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542
#3 0x00007fa041e15477 in elog_finish (elevel=<optimized out>,
fmt=0x7fa041f32a48 "stuck spinlock (%p) detected at %s:%d")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297
#4 0x00007fa041d30db3 in s_lock_stuck (line=1099,
file=0x7fa041f2fa78 "/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c", lock=0x7f9c2acb2ac8 "\001")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40
#5 s_lock (lock=0x7f9c2acb2ac8 "\001",
file=0x7fa041f2fa78 "/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c", line=1099)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109
#6 0x00007fa041d191f0 in PinBuffer (buf=0x7f9c2acb2aa8, strategy=0x0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099
#7 0x00007fa041d19af9 in BufferAlloc (foundPtr=0x7fff1948963e "\001",
strategy=0x0, blockNum=8796, forkNum=MAIN_FORKNUM, relpersistence=112 'p',
smgr=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776
#8 ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=8796, mode=RBM_NORMAL, strategy=0x0,
hit=0x7fff194896af "")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333
#9 0x00007fa041d1a53e in ReadBufferExtended (reln=0x7f9c1edd4908,
forkNum=MAIN_FORKNUM, blockNum=8796, mode=<optimized out>,
strategy=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252
#10 0x00007fa041b5a706 in heapgetpage (scan=0x7fa043389050, page=8796)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:332
#11 0x00007fa041b5ac12 in heapgettup_pagemode (scan=0x7fa043389050,
dir=<optimized out>, nkeys=0, key=0x0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:939
#12 0x00007fa041b5bf76 in heap_getnext (scan=0x7fa043389050,
direction=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:1459
#13 0x00007fa041c7a9eb in SeqNext (node=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeSeqscan.c:66
#14 0x00007fa041c6998e in ExecScanFetch (
recheckMtd=0x7fa041c7a9b0 <SeqRecheck>,
accessMtd=0x7fa041c7a9c0 <SeqNext>, node=0x7fa0440f1c10)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:82
#15 ExecScan (node=0x7fa0440f1c10, accessMtd=0x7fa041c7a9c0 <SeqNext>,
recheckMtd=0x7fa041c7a9b0 <SeqRecheck>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167
#16 0x00007fa041c629c8 in ExecProcNode (node=0x7fa0440f1c10)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:400
#17 0x00007fa041c6f07f in agg_retrieve_direct (aggstate=0x7fa0440f1510)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121
#18 ExecAgg (node=0x7fa0440f1510)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013
#19 0x00007fa041c628b8 in ExecProcNode (node=0x7fa0440f1510)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476
#20 0x00007fa041c5fdd2 in ExecutePlan (dest=0x7fa042a955e0,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x7fa0440f1510, estate=0x7fa0440f1400)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472
#21 standard_ExecutorRun (queryDesc=0x7fa0440f0ff0, direction=<optimized out>,
count=0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307
#22 0x00007fa03c0d428d in explain_ExecutorRun (queryDesc=0x7fa0440f0ff0,
direction=ForwardScanDirection, count=0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233
#23 0x00007fa03bece525 in pgss_ExecutorRun (queryDesc=0x7fa0440f0ff0,
direction=ForwardScanDirection, count=0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717
#24 0x00007fa041d40207 in PortalRunSelect (portal=0x7fa0427061f0,
forward=<optimized out>, count=0, dest=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946
#25 0x00007fa041d41651 in PortalRun (portal=0x7fa0427061f0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7fa042a955e0,
altdest=0x7fa042a955e0, completionTag=0x7fff19489fa0 "")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:790
#26 0x00007fa041d3d960 in exec_simple_query (
query_string=0x7fa0426034a0 "SELECT COUNT(*) FROM \"signups\" WHERE (signups.active_customer_expires_at > '2013-11-11' and (signups.overdue_invoices_count is null or signups.overdue_invoices_count = 0))")
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/postgres.c:1048
#27 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0x7fa0425eb138 "nbuild", username=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/postgres.c:4005
#28 0x00007fa041cfa765 in BackendRun (port=0x7fa04262f050)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:3999
#29 BackendStartup (port=0x7fa04262f050)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:3688
#30 ServerLoop ()
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:1589
#31 PostmasterMain (argc=<optimized out>, argv=<optimized out>)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/postmaster/postmaster.c:1258
#32 0x00007fa041b36ea2 in main (argc=5, argv=0x7fa0425e91a0)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/main/main.c:196
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Christophe Pettus <xof@thebuild.com> writes:
On Dec 12, 2013, at 4:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Could you install the -dbg package and regenerate?
Here's another, same system, different crash:
Both of these look like absolutely run-of-the-mill buffer access attempts.
Presumably, we are seeing the victim rather than the perpetrator of
whatever is going wrong. Whoever is holding the spinlock is just
going down with the rest of the system ...
In a devel environment, I'd try using the postmaster's -T switch so that
it SIGSTOP's all the backends instead of SIGQUIT'ing them, and then I'd
run around and gdb all the other backends to try to see which one was
holding the spinlock and why. Unfortunately, that's probably not
practical in a production environment; it'd take too long to collect
the stack traces by hand. So I have no good ideas about how to debug
this, unless you can reproduce it on a devel box, or are willing to
run modified executables in production.
Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
most systems dump core files with process IDs embedded in the names.
What would be more useful today is an option to send SIGABRT, or some
other signal that would force core dumps. Thoughts?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Presumably, we are seeing the victim rather than the perpetrator of
whatever is going wrong.
This is probing about a bit blindly, but the only thing I can see about this system that is in some way unique (and this is happening on multiple machines, so it's unlikely to be hardware) is that there are a relatively large number of relations (like, 440,000+) distributed over many schemas. Is there anything that pins a buffer that is O(N) to the number of relations?
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Christophe Pettus <xof@thebuild.com> writes:
On Dec 12, 2013, at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Presumably, we are seeing the victim rather than the perpetrator of
whatever is going wrong.
This is probing about a bit blindly, but the only thing I can see about this system that is in some way unique (and this is happening on multiple machines, so it's unlikely to be hardware) is that there are a relatively large number of relations (like, 440,000+) distributed over many schemas. Is there anything that pins a buffer that is O(N) to the number of relations?
It's not a buffer *pin* that's at issue, it's a buffer header spinlock.
And there are no loops, of any sort, that are executed while holding
such a spinlock. At least not in the core PG code. Are you possibly
using any nonstandard extensions?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Are you possibly using any nonstandard extensions?
No, totally stock PostgreSQL.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote:
Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form:
Is it really a regular pattern like hourly? What's your
checkpoint_segments?
Could you, arround the time of a crash, check "grep Dirt
/proc/meminfo" and run iostat -xm 1 20?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 12, 2013 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
most systems dump core files with process IDs embedded in the names.
What would be more useful today is an option to send SIGABRT, or some
other signal that would force core dumps. Thoughts?
I think it would be possible, at least on Linux, to have GDB connect
to the postmaster, and then automatically create new inferiors as new
backends are forked, and then have every inferior paused as
breakpoints are hit. See:
http://sourceware.org/gdb/onlinedocs/gdb/Forks.html
and
http://sourceware.org/gdb/onlinedocs/gdb/All_002dStop-Mode.html
(I think the word 'thread' is just a shorthand for 'inferior' in the
"stops mode" doc page, and you can definitely debug Postgres processes
in multiple inferiors today).
Now, I'm not sure how feasible this is in a production debugging
situation. It seems like an interesting way of debugging these sorts
of issues that should be explored and perhaps subsequently codified.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-12-12 21:15:29 -0500, Tom Lane wrote:
Christophe Pettus <xof@thebuild.com> writes:
On Dec 12, 2013, at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Presumably, we are seeing the victim rather than the perpetrator of
whatever is going wrong.This is probing about a bit blindly, but the only thing I can see about this system that is in some way unique (and this is happening on multiple machines, so it's unlikely to be hardware) is that there are a relatively large number of relations (like, 440,000+) distributed over many schemas. Is there anything that pins a buffer that is O(N) to the number of relations?
It's not a buffer *pin* that's at issue, it's a buffer header spinlock.
And there are no loops, of any sort, that are executed while holding
such a spinlock. At least not in the core PG code. Are you possibly
using any nonstandard extensions?
It could maybe be explained by a buffer aborting while performing
IO. Until it has call AbortBufferIO(), other backends will happily loop
in WaitIO(), constantly taking the the buffer header spinlock and
locking io_in_progress_lock in shared mode, thereby preventing
AbortBufferIO() from succeeding.
Christophe: are there any "unusual" ERROR messages preceding the crash,
possibly some minutes before?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Dec 12, 2013, at 6:24 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Is it really a regular pattern like hourly? What's your
checkpoint_segments?
No, it's not a pattern like that; that's an approximation. Sometimes, they come in clusters, sometimes, 2-3 hours past without one. They don't happen exclusively inside or outside of a checkpoint.
checkpoint_timeout = 5min
checkpoint_segments = 64
checkpoint_completion_target = 0.9
Could you, arround the time of a crash, check "grep Dirt
/proc/meminfo" and run iostat -xm 1 20?
Dirty: 30104 kB
avg-cpu: %user %nice %system %iowait %steal %idle
3.70 0.00 0.91 0.53 0.00 94.85
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.83 113.13 1.18 2.01 0.06 0.45 329.29 0.04 12.18 1.28 18.55 0.57 0.18
sdb 0.06 113.15 0.98 1.99 0.06 0.45 349.36 0.24 79.30 3.57 116.60 1.46 0.43
md0 0.00 0.00 0.00 0.00 0.00 0.00 3.39 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 1.18 114.92 0.01 0.45 8.01 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.06 111.82 0.00 0.44 8.02 0.57 4.88 0.24 4.89 0.04 0.43
dm-1 0.00 0.00 1.11 3.03 0.00 0.01 8.00 1.25 300.47 0.38 410.89 0.17 0.07
sdc 0.00 0.00 12.10 136.13 0.50 19.97 282.85 1.94 13.07 2.30 14.03 0.55 8.20
dm-2 0.00 39.63 24.23 272.24 1.00 39.82 281.97 1.31 4.44 1.98 4.65 0.44 13.03
sdd 0.00 0.00 12.13 136.11 0.50 19.84 281.10 1.35 9.10 1.64 9.77 0.42 6.21
avg-cpu: %user %nice %system %iowait %steal %idle
1.09 0.00 0.08 0.13 0.00 98.71
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 558.00 0.00 8.95 32.85 7.36 13.20 0.00 13.20 0.12 6.80
dm-2 0.00 28.00 0.00 558.00 0.00 8.95 32.85 7.38 13.23 0.00 13.23 0.12 6.80
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.38 0.00 0.17 0.13 0.00 99.33
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 36.00 11.00 0.18 0.15 14.30 0.06 1.36 0.67 3.64 0.94 4.40
dm-2 0.00 0.00 36.00 11.00 0.18 0.15 14.30 0.06 1.36 0.67 3.64 0.94 4.40
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.83 0.00 0.29 0.04 0.00 98.83
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 33.00 0.00 0.32 0.00 19.88 0.12 3.52 3.52 0.00 0.61 2.00
dm-2 0.00 0.00 33.00 0.00 0.32 0.00 19.88 0.12 3.52 3.52 0.00 0.61 2.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.08 0.00 0.29 0.00 0.00 98.62
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 6.00 4.00 0.07 0.03 20.80 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 6.00 4.00 0.07 0.03 20.80 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.08 0.04 0.00 99.37
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 11.00 3.00 0.22 0.12 49.14 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 11.00 3.00 0.22 0.12 49.14 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers