Segfaults with 8.1.3 on amd64

Started by Gavin Hamillalmost 20 years ago4 messagesgeneral
Jump to latest
#1Gavin Hamill
gdh@laterooms.com

Hi, our 8.1.3 system on quad Xeon has been happily chugging away for
weeks with no stability problems until yesterday:

/var/log/syslog:May 4 11:57:17 cayenne kernel: postmaster[19291]:
segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418
error 4
/var/log/syslog.0:May 3 09:39:06 cayenne kernel: postmaster[32698]:
segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418
error 4
/var/log/syslog.0:May 3 11:02:00 cayenne kernel: postmaster[12427]:
segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418
error 4

I don't know what the rip + rsp values represent, but is it interesting
that they are identical in all three cases?

Not a single OS change has occurred on the machine - in fact the only
thing happening other than pg itself is me tail'ing the logs..

I'm using Debian sarge with the 8.1.3 debs from backports.org which I
trust; I doubt running postmaster under gdb will be workable due to the
performance penalty.

The pg logs don't show much of interest:

2006-05-04 11:57:17 BST LOG: server process (PID 19291) was terminated
by signal 11
2006-05-04 11:57:17 BST LOG: terminating any other active server processes
2006-05-04 11:57:17 BST WARNING: terminating connection because of
crash of another server process
2006-05-04 11:57:17 BST DETAIL: The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2006-05-04 11:57:17 BST HINT: In a moment you should be able to
reconnect to the database and repeat your command.

[loads of these]

2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode
2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode
2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode
2006-05-04 11:57:18 BST FATAL: the database system is in recovery mode
2006-05-04 11:57:18 BST LOG: all server processes terminated;
reinitializing
2006-05-04 11:57:18 BST FATAL: the database system is starting up
2006-05-04 11:57:18 BST FATAL: the database system is starting up
2006-05-04 11:57:18 BST FATAL: the database system is starting up
2006-05-04 11:57:18 BST FATAL: the database system is starting up
2006-05-04 11:57:18 BST FATAL: the database system is starting up
2006-05-04 11:57:18 BST LOG: database system was interrupted at
2006-05-04 11:56:17 BST
2006-05-04 11:57:18 BST LOG: checkpoint record is at 68/A9D2F2E8
2006-05-04 11:57:18 BST LOG: redo record is at 68/A9D17DD0; undo record
is at 0/0; shutdown FALSE
2006-05-04 11:57:18 BST LOG: next transaction ID: 728532363; next OID:
183302937
2006-05-04 11:57:18 BST LOG: next MultiXactId: 46957; next
MultiXactOffset: 98539
2006-05-04 11:57:18 BST LOG: database system was not properly shut
down; automatic recovery in progress
2006-05-04 11:57:18 BST LOG: redo starts at 68/A9D17DD0
2006-05-04 11:57:18 BST FATAL: the database system is starting up

[ loads of these]

2006-05-04 11:57:19 BST LOG: record with zero length at 68/ABAF4F48
2006-05-04 11:57:19 BST LOG: redo done at 68/ABAF4F18
2006-05-04 11:57:19 BST LOG: could not truncate directory
"pg_multixact/members": apparent wraparound
2006-05-04 11:57:19 BST LOG: database system is ready
2006-05-04 11:57:19 BST LOG: transaction ID wrap limit is 1362094701,
limited by database "postgres"

Encouragingly, pg_config shows that --enable_debug was passed as a
./configure argument:

CONFIGURE = '--build=x86_64-linux' '--prefix=/usr'
'--includedir=/usr/include' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var'
'--libexecdir=/usr/lib/postgresql-8.1' '--srcdir=.'
'--disable-maintainer-mode' '--mandir=/usr/share/postgresql/8.1/man'
'--with-docdir=/usr/share/doc/postgresql-doc-8.1'
'--datadir=/usr/share/postgresql/8.1'
'--bindir=/usr/lib/postgresql/8.1/bin'
'--includedir=/usr/include/postgresql/' '--enable-nls'
'--enable-integer-datetimes' '--enable-debug' '--disable-rpath'
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-krb5'
'--with-openssl' '--with-gnu-ld' '--with-tclconfig=/usr/lib/tcl8.4'
'--with-tkconfig=/usr/lib/tk8.4' '--with-includes=/usr/include/tcl8.4'
'--with-pgport=5432' '--enable-thread-safety' 'CC=cc' 'CFLAGS=-g -Wall
-O2 -Wl,--as-needed' 'build_alias=x86_64-linux'
CC = cc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/tcl8.4
CFLAGS = -g -Wall -O2 -Wl,--as-needed -Wall -Wmissing-prototypes
-Wpointer-arith -Winline -Wendif-labels -fno-strict-aliasing -g
CFLAGS_SL = -fpic
LDFLAGS =
LDFLAGS_SL =
LIBS = -lpgport -lpam -lssl -lcrypto -lkrb5 -lz -lreadline -lcrypt
-lresolv -lnsl -ldl -lm
VERSION = PostgreSQL 8.1.3

How can I enable coredumps or something similarly useful for debugging
purposes?

Cheers,
Gavin.

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Gavin Hamill (#1)
Re: Segfaults with 8.1.3 on amd64

On Thu, May 04, 2006 at 12:22:01PM +0100, Gavin Hamill wrote:

Hi, our 8.1.3 system on quad Xeon has been happily chugging away for
weeks with no stability problems until yesterday:

/var/log/syslog:May 4 11:57:17 cayenne kernel: postmaster[19291]:
segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp 00007fffffffd418
error 4

<snip>

I don't know what the rip + rsp values represent, but is it interesting
that they are identical in all three cases?

At a guess rip = return instruction pointer, rsp = return stack point.
The fact that they're all the same seems to rule out hardware.

I'm using Debian sarge with the 8.1.3 debs from backports.org which I
trust; I doubt running postmaster under gdb will be workable due to the
performance penalty.

I didn't think attaching gds had much effect on performance, but you
may be right.

<snip other usual output on server crash>

How can I enable coredumps or something similarly useful for debugging
purposes?

Before starting the server, run "ulimit -S -c unlimited"

If done properly it should enable core dumps for the backend.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Guy Rouillier
guyr@masergy.com
In reply to: Martijn van Oosterhout (#2)
Re: Segfaults with 8.1.3 on amd64

Martijn van Oosterhout wrote:

On Thu, May 04, 2006 at 12:22:01PM +0100, Gavin Hamill wrote:

Hi, our 8.1.3 system on quad Xeon has been happily chugging away for
weeks with no stability problems until yesterday:

/var/log/syslog:May 4 11:57:17 cayenne kernel: postmaster[19291]:
segfault at 0000000000000000 rip 00002aaaab5e8c00 rsp
00007fffffffd418 error 4

<snip>

I don't know what the rip + rsp values represent, but is it
interesting that they are identical in all three cases?

At a guess rip = return instruction pointer, rsp = return stack
point. The fact that they're all the same seems to rule out hardware.

The R* registers in AMD64 are just the 64-bit extensions to the standard
registers. They couldn't use EIP because that was taken in the
expansion from 16-bit to 32-bit. So RIP is simply the 64-bit
instruction pointer, RSP the 64-bit stack pointer.

--
Guy Rouillier

#4Gavin Hamill
gdh@laterooms.com
In reply to: Martijn van Oosterhout (#2)
Re: Segfaults with 8.1.3 on amd64

Martijn van Oosterhout wrote:

On Thu, May 04, 2006 at 12:22:01PM +0100, Gavin Hamill wrote:

At a guess rip = return instruction pointer, rsp = return stack point.
The fact that they're all the same seems to rule out hardware.

That's good to hear (in one way... :)

fore starting the server, run "ulimit -S -c unlimited"

If done properly it should enable core dumps for the backend.

Have a nice day,

Great stuff - it's crashed again and dropped 6MB of core which points
the finger squarely at Slony - I'll ask on the relevant list :)

Core was generated by `postgres: sharp laterooms 194.24.250.135(54478)
UPDATE '.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done.
....
Reading symbols from /usr/lib/postgresql/8.1/lib/slony1_funcs.so...done.
Loaded symbols for /usr/lib/postgresql/8.1/lib/slony1_funcs.so
Reading symbols from /usr/lib/postgresql/8.1/lib/xxid.so...done.
Loaded symbols for /usr/lib/postgresql/8.1/lib/xxid.so
#0 0x00002aaaab5e8c00 in strlen () from /lib/libc.so.6
(gdb) bt
#0 0x00002aaaab5e8c00 in strlen () from /lib/libc.so.6
#1 0x00002aaaca65b062 in slon_quote_literal (str=0x0) at
slony1_funcs.c:1044
#2 0x00002aaaca65c348 in _Slony_I_logTrigger (fcinfo=0x8f5ec5) at
slony1_funcs.c:783
#3 0x00000000005ca9f9 in fmgr_internal_function ()
#4 0x00000000004ce6a4 in FreeTriggerDesc ()
#5 0x00000000004cf42e in ExecARUpdateTriggers ()
#6 0x00000000004cf873 in ExecARUpdateTriggers ()
#7 0x00000000004cfb10 in AfterTriggerEndQuery ()
#8 0x000000000055ef05 in FreeQueryDesc ()
#9 0x000000000055fecf in PortalRun ()
#10 0x000000000055f78f in PortalRun ()
#11 0x000000000055b721 in pg_plan_queries ()
#12 0x000000000055e14c in PostgresMain ()
#13 0x0000000000539cc1 in ClosePostmasterPorts ()
#14 0x0000000000539797 in ClosePostmasterPorts ()
#15 0x0000000000537d3d in PostmasterMain ()
#16 0x000000000053704e in PostmasterMain ()
#17 0x00000000004fdb58 in main ()

Cheers,
Gavin,