[7.0.2] problems with spinlock under FreeBSD?

Started by The Hermit Hackerover 25 years ago12 messages
#1The Hermit Hacker
scrappy@hub.org

We had a crash this morning of our server ... not the machine, just the
postmaster processes ... all three of them spread across three seperate
ports ...

Just looking through logs now, I'm finding:

FATAL: s_lock(20048065) at spin.c:127, stuck spinlock. Aborting.

In one of them ...

I've got a load of core files all from about the same time, from the
various processes running on teh different ports:

714266 9936 -rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 ./data/base/petpostings/postgres.core
944391 9984 -rw------- 1 pgsql pgsql 5099520 Aug 23 10:32 ./data/base/trends_acctng/postgres.core
1015889 9984 -rw------- 1 pgsql pgsql 5099520 Aug 23 10:33 ./data/base/pg_banners/postgres.core
1055605 10768 -rw------- 1 pgsql pgsql 5505024 Aug 23 10:34 ./data/base/rockwell/postgres.core
904800 10032 -rw------- 1 pgsql pgsql 5124096 Aug 23 10:32 ./data/base/area902/postgres.core
619085 74720 -rw------- 1 pgsql pgsql 38219776 Jun 7 21:09 ./data/base/thtphone/postgres.core
1944360 9936 -rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 ./data/base/counter/postgres.core
896891 9808 -rw------- 1 pgsql pgsql 5009408 Aug 23 10:32 ./data/base/hub_traf_stats/postgres.core
1849088 20656 -rw------- 1 pgsql pgsql 10567680 Aug 23 09:56 ./data/base/horde/postgres.core
849311 7136 -rw------- 1 pgsql pgsql 3645440 Aug 23 12:36 ./special/mukesh/base/archies/postgres.core
857377 7104 -rw------- 1 pgsql pgsql 3629056 Aug 23 12:36 ./special/mukesh/base/water/postgres.core
8009 44176 -rw------- 1 pgsql pgsql 22589440 Aug 23 10:39 ./data2/udmsearch/postgres.core

if any of those help any?

looking in /var/log/messages, I'm seeing the following just before the
crashes:

Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
Aug 23 12:33:47 pgsql /kernel: file: table is full

would this be the cause? if so, I have to raise some limits, no problem
there, but just want to confirm ...

thanks ...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#1)
Re: [7.0.2] problems with spinlock under FreeBSD?

The Hermit Hacker <scrappy@hub.org> writes:

looking in /var/log/messages, I'm seeing the following just before the
crashes:

Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
Aug 23 12:33:47 pgsql /kernel: file: table is full

That sure looks like you'd better tweak your kernel settings ... but
offhand I don't see how it could lead to "stuck spinlock" errors.
What do you get from gdb backtraces on the corefiles?

regards, tom lane

#3Vince Vielhaber
vev@michvhf.com
In reply to: The Hermit Hacker (#1)
Re: [7.0.2] problems with spinlock under FreeBSD?

On Wed, 23 Aug 2000, The Hermit Hacker wrote:

We had a crash this morning of our server ... not the machine, just the
postmaster processes ... all three of them spread across three seperate
ports ...

Just looking through logs now, I'm finding:

[snip]

looking in /var/log/messages, I'm seeing the following just before the
crashes:

Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
Aug 23 12:33:47 pgsql /kernel: file: table is full

would this be the cause? if so, I have to raise some limits, no problem
there, but just want to confirm ...

What's maxusers set to in the kernel? If you want to try raising it on
the fly, try

# sysctl -w kern.maxfiles=abiggernumberthanitisnow

and set it to a bigger number than it is now.

# sysctl kern.maxfiles

will tell you what it's currently set to. Whether or not that has
anything to do with the spinlock problem, no idea.

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#4The Hermit Hacker
scrappy@hub.org
In reply to: Vince Vielhaber (#3)
Re: [7.0.2] problems with spinlock under FreeBSD?

On Thu, 24 Aug 2000, Vince Vielhaber wrote:

On Wed, 23 Aug 2000, The Hermit Hacker wrote:

We had a crash this morning of our server ... not the machine, just the
postmaster processes ... all three of them spread across three seperate
ports ...

Just looking through logs now, I'm finding:

[snip]

looking in /var/log/messages, I'm seeing the following just before the
crashes:

Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
Aug 23 12:33:47 pgsql /kernel: file: table is full

would this be the cause? if so, I have to raise some limits, no problem
there, but just want to confirm ...

What's maxusers set to in the kernel? If you want to try raising it on

maxusers 128

the fly, try

# sysctl -w kern.maxfiles=abiggernumberthanitisnow

and set it to a bigger number than it is now.

# sysctl kern.maxfiles

just up'd her to 8192 from 4136 ... thanks, forgot about the sysctl value
for this, was dreading having to recompile :(

#5The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#2)
1 attachment(s)
Re: [7.0.2] problems with spinlock under FreeBSD?

Attached ...

On Thu, 24 Aug 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

looking in /var/log/messages, I'm seeing the following just before the
crashes:

Aug 23 12:33:47 pgsql syslogd: /dev/console: Too many open files in system: Too many open files in system
Aug 23 12:33:47 pgsql syslogd: /var/run/utmp: Too many open files in system
Aug 23 12:33:47 pgsql /kernel: file: table is full

That sure looks like you'd better tweak your kernel settings ... but
offhand I don't see how it could lead to "stuck spinlock" errors.
What do you get from gdb backtraces on the corefiles?

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Attachments:

gdb.outtext/plain; charset=US-ASCII; name=gdb.outDownload
#6Vince Vielhaber
vev@michvhf.com
In reply to: The Hermit Hacker (#4)
Re: [7.0.2] problems with spinlock under FreeBSD?

On Thu, 24 Aug 2000, The Hermit Hacker wrote:

the fly, try

# sysctl -w kern.maxfiles=abiggernumberthanitisnow

and set it to a bigger number than it is now.

# sysctl kern.maxfiles

just up'd her to 8192 from 4136 ... thanks, forgot about the sysctl value
for this, was dreading having to recompile :(

I had to do it a couple of months ago on one of my machines. It was
almost at a year of uptime and I didn't want to reboot if I didn't
have to!

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#5)
Re: [7.0.2] problems with spinlock under FreeBSD?

The Hermit Hacker <scrappy@hub.org> writes:

What do you get from gdb backtraces on the corefiles?

#2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
#3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
#4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127
#5 0x80f3903 in LockRelease (lockmethod=1, locktag=0xbfbfe674, lockmode=1) at lock.c:1044
#6 0x80f2af9 in UnlockRelation (relation=0x82063f0, lockmode=1) at lmgr.c:178
#7 0x806f25e in index_endscan (scan=0x8208780) at indexam.c:284

That's interesting ... someone failing to release lock.c's master
spinlock, it looks like. Do you have anything in the postmaster log
from just before the crashes?

regards, tom lane

#8The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#7)
Re: [7.0.2] problems with spinlock under FreeBSD?

On Thu, 24 Aug 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

What do you get from gdb backtraces on the corefiles?

#2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
#3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
#4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127
#5 0x80f3903 in LockRelease (lockmethod=1, locktag=0xbfbfe674, lockmode=1) at lock.c:1044
#6 0x80f2af9 in UnlockRelation (relation=0x82063f0, lockmode=1) at lmgr.c:178
#7 0x806f25e in index_endscan (scan=0x8208780) at indexam.c:284

That's interesting ... someone failing to release lock.c's master
spinlock, it looks like. Do you have anything in the postmaster log
from just before the crashes?

okay, nothing that I can see that is 'unusual' in the log files, but as
shown below, at ~10:30am today, the same thing appears to have happened
...

%ls -lt */*.core
-rw------- 1 pgsql pgsql 22589440 Aug 23 10:39 udmsearch/postgres.core
-rw------- 1 pgsql pgsql 5505024 Aug 23 10:34 rockwell/postgres.core
-rw------- 1 pgsql pgsql 5099520 Aug 23 10:33 pg_banners/postgres.core
-rw------- 1 pgsql pgsql 5009408 Aug 23 10:32 hub_traf_stats/postgres.core
-rw------- 1 pgsql pgsql 5099520 Aug 23 10:32 trends_acctng/postgres.core
-rw------- 1 pgsql pgsql 5124096 Aug 23 10:32 area902/postgres.core
-rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 petpostings/postgres.core
-rw------- 1 pgsql pgsql 5074944 Aug 23 10:32 counter/postgres.core
-rw------- 1 pgsql pgsql 10567680 Aug 23 09:56 horde/postgres.core

Check the gdb on a couple of them:

(gdb) where
#0 0x18271d90 in kill () from /usr/lib/libc.so.4
#1 0x182b2e09 in abort () from /usr/lib/libc.so.4
#2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
#3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
#4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127

(gdb) where
#0 0x18271d90 in kill () from /usr/lib/libc.so.4
#1 0x182b2e09 in abort () from /usr/lib/libc.so.4
#2 0x80ee847 in s_lock_stuck (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:51
#3 0x80ee8c3 in s_lock (lock=0x20048065 "\001", file=0x816723c "spin.c", line=127) at s_lock.c:80
#4 0x80f1580 in SpinAcquire (lockid=7) at spin.c:127

they all appear to be in the same place ...

now, I'm running 4 seperate postmaster daemons, with seperate data
directories, as:

ps ux | grep postmaster | grep 543
pgsql 50554 0.0 0.1 6904 556 p0- I 1:12PM 0:04.88 /pgsql/bin/postmaster -D/pgsql/special/sales.org -i -p 5434 (postgres)
pgsql 61821 0.0 0.1 7080 636 p6- S 4:38PM 3:03.86 /pgsql/bin/postmaster -B 256 -N 128 -o -F -o /pgsql/logs/5432.61820 -S 32768 -i -p 5432 -D/pgsql/data (postgres)
pgsql 62268 0.0 0.0 5488 0 p4- IW - 0:00.00 /pgsql/bin/postmaster -d 1 -N 16 -o -F -o /pgsql/logs/5433.62267 -S 32768 -i -p 5433 -D/pgsql/special/lee (postgres)
pgsql 27084 0.0 0.1 5496 596 p4- S 8:25AM 0:54.11 /pgsql/bin/postmaster -d 1 -N 16 -o -F -o /pgsql/logs/5437.27083 -S 32768 -i -p 5437 -D/pgsql/special/mukesh (postgres)

and the above core files are from the one running on 5432 ...

you still have your account on that machine if you want to take a quick
look around ... else, anything else I should be looking at?

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#8)
Re: [7.0.2] problems with spinlock under FreeBSD?

The Hermit Hacker <scrappy@hub.org> writes:

you still have your account on that machine if you want to take a quick
look around ... else, anything else I should be looking at?

I poked around and couldn't learn much of anything --- the logfiles from
yesterday are already gone, apparently. I did find some interesting
entries in today's logfiles:

%grep Lru *
5432.61820:FATAL 1: ReleaseLruFile: No open files available to be closed
postmaster.5437.36290:FATAL 1: ReleaseLruFile: No open files available to be closed
postmaster.5437.62218:FATAL 1: ReleaseLruFile: No open files available to be closed

What we see here are backends choking because there are no free kernel
file descriptor slots, even after they've closed *all* of their own
discretionary FDs. So you've definitely got a serious problem with
insufficient FD slots. Time to tweak those kernel parameters.

I still don't see a linkage between too few FDs and the stuck-spinlock
crashes, but maybe there is one.

regards, tom lane

#10The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#9)
Re: [7.0.2] problems with spinlock under FreeBSD?

okay, I just doubled my FDs to 8192 from 4136 and will watch things
... anyone know of a way of telling how many are currently in use, and
where they peaked? somethign similar to 'netstat -m' showing mbufs?

thanks tom ...

On Thu, 24 Aug 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

you still have your account on that machine if you want to take a quick
look around ... else, anything else I should be looking at?

I poked around and couldn't learn much of anything --- the logfiles from
yesterday are already gone, apparently. I did find some interesting
entries in today's logfiles:

%grep Lru *
5432.61820:FATAL 1: ReleaseLruFile: No open files available to be closed
postmaster.5437.36290:FATAL 1: ReleaseLruFile: No open files available to be closed
postmaster.5437.62218:FATAL 1: ReleaseLruFile: No open files available to be closed

What we see here are backends choking because there are no free kernel
file descriptor slots, even after they've closed *all* of their own
discretionary FDs. So you've definitely got a serious problem with
insufficient FD slots. Time to tweak those kernel parameters.

I still don't see a linkage between too few FDs and the stuck-spinlock
crashes, but maybe there is one.

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#11Vince Vielhaber
vev@michvhf.com
In reply to: The Hermit Hacker (#10)
Re: [7.0.2] problems with spinlock under FreeBSD?

On Thu, 24 Aug 2000, The Hermit Hacker wrote:

okay, I just doubled my FDs to 8192 from 4136 and will watch things
... anyone know of a way of telling how many are currently in use, and
where they peaked? somethign similar to 'netstat -m' showing mbufs?

Possibly pstat will do it. It'll give current but I don't know about
peak.

Vince.
--
==========================================================================
Vince Vielhaber -- KA8CSH email: vev@michvhf.com http://www.pop4.net
128K ISDN from $22.00/mo - 56K Dialup from $16.00/mo at Pop4 Networking
Online Campground Directory http://www.camping-usa.com
Online Giftshop Superstore http://www.cloudninegifts.com
==========================================================================

#12Oleg Bartunov
oleg@sai.msu.su
In reply to: The Hermit Hacker (#10)
Re: [7.0.2] problems with spinlock under FreeBSD?

On Thu, 24 Aug 2000, The Hermit Hacker wrote:

Date: Thu, 24 Aug 2000 16:26:07 -0300 (ADT)
From: The Hermit Hacker <scrappy@hub.org>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [7.0.2] problems with spinlock under FreeBSD?

okay, I just doubled my FDs to 8192 from 4136 and will watch things
... anyone know of a way of telling how many are currently in use, and
where they peaked? somethign similar to 'netstat -m' showing mbufs?

I have on rather busy site running FreeBSD
12:57:38[info]:/home/megera$ sysctl kern.maxfiles
kern.maxfiles: 16424

Did you try systat -vm ?

Regards,

Oleg

thanks tom ...

On Thu, 24 Aug 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

you still have your account on that machine if you want to take a quick
look around ... else, anything else I should be looking at?

I poked around and couldn't learn much of anything --- the logfiles from
yesterday are already gone, apparently. I did find some interesting
entries in today's logfiles:

%grep Lru *
5432.61820:FATAL 1: ReleaseLruFile: No open files available to be closed
postmaster.5437.36290:FATAL 1: ReleaseLruFile: No open files available to be closed
postmaster.5437.62218:FATAL 1: ReleaseLruFile: No open files available to be closed

What we see here are backends choking because there are no free kernel
file descriptor slots, even after they've closed *all* of their own
discretionary FDs. So you've definitely got a serious problem with
insufficient FD slots. Time to tweak those kernel parameters.

I still don't see a linkage between too few FDs and the stuck-spinlock
crashes, but maybe there is one.

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83