troubleshooting hung postmasters

Started by Ed L.over 20 years ago2 messagesgeneral
Jump to latest
#1Ed L.
pgsql@bluepolka.net

I need some troubleshooting help with two hung postmasters
this morning, both PostgreSQL 7.4.6 on ia64-hp-hpux11.23,
compiled by gcc 3.4.2, each on a separate rx4640. Existing
connections were processing queries just fine. Any new
connection attempts hung with no server log output. The
postmaster was unresponsive to SIGTERM/INT/QUIT. I had to
take them down harshly to restart. The systems were using
86% and 96% of RAM, respectively, and virtually no CPU or
I/O, processing ~60 and ~150 transactions/second at the
time of the hang.

I've collected the backtraces from both postmasters, they are
pretty much identical:

(gdb) attach 2601
Attaching to program: /opt/postgresql-7.4.6/bin/postgres, process
2601
warning: The shared libraries were not privately mapped; setting a
breakpoint in a shared library will not work until you rerun the program.
Use the following command to enable debugging of shared libraries.
chatr +dbg enable a.out
Reading symbols from /usr/lib/hpux64/libxnet.so.1...done.
Reading symbols from /usr/lib/hpux64/libc.so.1...done.
Reading symbols from /usr/lib/hpux64/libgen.so.1...done.
Reading symbols from /usr/lib/hpux64/libdl.so.1...done.
Reading symbols from /usr/lib/hpux64/libnsl.so.1...done.
Reading symbols from /usr/lib/hpux64/libm.so.1...done.
Reading symbols from /usr/lib/hpux64/libxti.so.1...done.
Reading symbols from /usr/lib/hpux64/libnss_files.so.1...done.
0xc000000000304230:0 in _accept_sys+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) p debug_query_string
$1 = 0
(gdb) bt
#0 0xc000000000304230:0 in _accept_sys+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc0000000003100b0:0 in accept+0x150 () from /usr/lib/hpux64/libc.so.1
#2 0xc000000001aac450:0 in accept+0x70 () from /usr/lib/hpux64/libxnet.so.1
#3 0x400000000029a710:0 in StreamConnection+0x50 ()
#4 0x40000000003147f0:0 in ServerLoop+0x4e0 ()
#5 0x4000000000318cd0:0 in PostmasterMain+0x1080 ()
#6 0x400000000029e470:0 in main+0x490 ()

Any suggestions? I can upgrade to 7.4.

Ed

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ed L. (#1)
Re: troubleshooting hung postmasters

"Ed L." <pgsql@bluepolka.net> writes:

I need some troubleshooting help with two hung postmasters
this morning, both PostgreSQL 7.4.6 on ia64-hp-hpux11.23,
compiled by gcc 3.4.2, each on a separate rx4640.

(gdb) bt
#0 0xc000000000304230:0 in _accept_sys+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc0000000003100b0:0 in accept+0x150 () from /usr/lib/hpux64/libc.so.1
#2 0xc000000001aac450:0 in accept+0x70 () from /usr/lib/hpux64/libxnet.so.1
#3 0x400000000029a710:0 in StreamConnection+0x50 ()
#4 0x40000000003147f0:0 in ServerLoop+0x4e0 ()
#5 0x4000000000318cd0:0 in PostmasterMain+0x1080 ()
#6 0x400000000029e470:0 in main+0x490 ()

Hmm, this appears to indicate that the postmaster is stuck on an
accept() on a socket that was reported read-ready by select()
(else control would not reach StreamConnection), yet the socket
doesn't actually have an available connection (else it wouldn't block).

I'm not sure how this condition can occur --- you might want to ask
some HPUX gurus about it. It looks like a kernel bug to me; I can't
see anything in the man pages suggesting that select's read-ready
indication might be a lie.

regards, tom lane