Postgres Crashes

Started by Prem Gopalanalmost 22 years ago5 messageshackers
Jump to latest
#1Prem Gopalan
prem@mazunetworks.com

We run a multithreaded application that uses postgres 7.4 on Linux
2.4.18, dual cpu Xeon processor machine. We have occassional weird
crashes and have tried a lot of things to reproduce them in house, but
in vain. We do have coredumps and I have listed the backtraces and
their common characteristics here.

Briefly, the last frame is a call to a glibc (or rarely some other
shared lib) method.And the instruction pointer points to an indirect
jmp instruction to the shared lib method. Almost all coredumps show
this characteristic.

The dying process is postmaster. After these crashes the server is
missing from ps and no more new connections are possible. The backend
processes stay on till their connections close.

Any ideas appreciated.

core #1 (8706)
-------

(gdb) bt
#0 0x0806f1c4 in snprintf ()
#1 0x081a7f50 in send_message_to_frontend (edata=0x826c1e0)
at /root/src/postgres/src/backend/utils/error/elog.c:1239
#2 0x081a6c85 in errfinish (dummy=0) at
/root/src/postgres/src/backend/utils/error/elog.c:359
#3 0x081a767e in elog_finish (elevel=20, fmt=0x8235680 "invalid
memory alloc request size %lu")
at /root/src/postgres/src/backend/utils/error/elog.c:853

(gdb) disassemble
Dump of assembler code for function snprintf:
0x0806f1c4 <snprintf+0>: jmp *0x823f02c
0x0806f1ca <snprintf+6>: push $0x508
0x0806f1cf <snprintf+11>: jmp 0x806e7a4 <_init+24>
End of assembler dump.

(gdb) x/i $pc
0x806f1c4 <snprintf>: jmp *0x823f02c

(gdb) x *0x823f02c
0x182de130 <snprintf>: push %ebp

(gdb) disassemble *0x823f02c
Dump of assembler code for function snprintf:
0x182de130 <snprintf+0>: push %ebp
0x182de131 <snprintf+1>: mov %esp,%ebp
0x182de133 <snprintf+3>: push %ebx

core #2 (5889)
-------

(gdb) bt
#0 0x0806f0b4 in memcpy ()
#1 0x08103cee in pq_getbytes (s=0xbfffeb5c ".", len=4) at
/root/src/postgres/src/backend/libpq/pqcomm.c:748
#2 0x08103e04 in pq_getmessage (s=0xbfffec10, maxlen=0) at
/root/src/postgres/src/backend/libpq/pqcomm.c:837
#3 0x0814c98b in SocketBackend (inBuf=0xbfffec10) at
/root/src/postgres/src/backend/tcop/postgres.c:377

(gdb) disassemble
Dump of assembler code for function memcpy:
0x0806f0b4 <memcpy+0>: jmp *0x823efe8
0x0806f0ba <memcpy+6>: push $0x480
0x0806f0bf <memcpy+11>: jmp 0x806e7a4 <_init+24>
End of assembler dump.

(gdb) x/i $pc
0x806f0b4 <memcpy>: jmp *0x823efe8

(gdb) x *0x823efe8
0x18304f18 <memcpy>: push %ebp

(gdb) disassemble *0x823efe8
Dump of assembler code for function memcpy:
0x18304f18 <memcpy+0>: push %ebp
0x18304f19 <memcpy+1>: mov %esp,%ebp
0x18304f1b <memcpy+3>: mov 0x10(%ebp),%eax

core #3 (32662)
-------

(gdb) bt
#0 0x0806f3c4 in strncpy ()
#1 0x081b22fa in set_ps_display (activity=0x4 <Address 0x4 out of
bounds>)
at /root/src/postgres/src/backend/utils/misc/ps_status.c:282
#2 0x0814f3f5 in PostgresMain (argc=4, argv=0x8279838,
username=0x8279808 "postgres")
at /root/src/postgres/src/backend/tcop/postgres.c:2805
#3 0x0812f24b in BackendFork (port=0x82877a8) at
/root/src/postgres/src/backend/postmaster/postmaster.c:2558

(gdb) x/i $pc
0x806f3c4 <strncpy>: jmp *0x823f0ac

(gdb) disassemble *0x823f0ac
Dump of assembler code for function strncpy:
0x183033c0 <strncpy+0>: push %ebp
0x183033c1 <strncpy+1>: mov %esp,%ebp
0x183033c3 <strncpy+3>: push %edi

core #4 (28335)
-------

(gdb) bt
#0 0x0806f0c1 in memcpy ()
#1 0x08103cee in pq_getbytes (s=0xbfffeb5c "\f", len=4) at
/root/src/postgres/src/backend/libpq/pqcomm.c:748
#2 0x08103e04 in pq_getmessage (s=0xbfffec10, maxlen=0) at
/root/src/postgres/src/backend/libpq/pqcomm.c:837
#3 0x0814c98b in SocketBackend (inBuf=0xbfffec10) at
/root/src/postgres/src/backend/tcop/postgres.c:377

(gdb) x/i $pc
0x806f0c1 <memcpy+13>: idiv %bh

(gdb) disassemble
Dump of assembler code for function memcpy:
0x0806f0b4 <memcpy+0>: jmp *0x823efe8
0x0806f0ba <memcpy+6>: push $0x480
0x0806f0bf <memcpy+11>: jmp 0x806e7a4 <_init+24>
End of assembler dump.

(gdb) disassemble *0x823efe8
Dump of assembler code for function memcpy:
0x18304f18 <memcpy+0>: push %ebp
0x18304f19 <memcpy+1>: mov %esp,%ebp
0x18304f1b <memcpy+3>: mov 0x10(%ebp),%eax

core #5 (22375)
--------

(gdb) bt
#0 0x0806f32c in SSL_CTX_use_certificate_file ()
#1 0x08103cee in pq_getbytes (s=0xbfffeb5c "\f", len=4) at
/root/src/postgres/src/backend/libpq/pqcomm.c:748
#2 0x08103e04 in pq_getmessage (s=0xbfffec10, maxlen=0) at
/root/src/postgres/src/backend/libpq/pqcomm.c:837
#3 0x0814c98b in SocketBackend (inBuf=0xbfffec10) at
/root/src/postgres/src/backend/tcop/postgres.c:377

(gdb) x/i $pc
0x806f32c <SSL_CTX_use_certificate_file+8>: add
$0x70e90000,%eax

(gdb) disassemble
Dump of assembler code for function SSL_CTX_use_certificate_file:
0x0806f324 <SSL_CTX_use_certificate_file+0>: jmp *0x823f084
0x0806f32a <SSL_CTX_use_certificate_file+6>: push $0x5b8
0x0806f32f <SSL_CTX_use_certificate_file+11>: jmp 0x806e7a4
<_init+24>
End of assembler dump.

(gdb) disassemble *0x823f084
Dump of assembler code for function SSL_CTX_use_certificate_file:
0x0806f324 <SSL_CTX_use_certificate_file+0>: jmp *0x823f084
0x0806f32a <SSL_CTX_use_certificate_file+6>: push $0x5b8
0x0806f32f <SSL_CTX_use_certificate_file+11>: jmp 0x806e7a4
<_init+24>
End of assembler dump.

core #6 (22371)
-------

(gdb) bt
#0 0x0806f2e0 in semop ()
#1 0x1f0812c0 in ?? ()
#2 0x08149778 in LWLockRelease (lockid=WALWriteLock) at
/root/src/postgres/src/backend/storage/lmgr/lwlock.c:493
#3 0x0808f969 in XLogFlush (record={xlogid = 1, xrecoff = 394033512})
at /root/src/postgres/src/backend/access/transam/xlog.c:1287

(gdb) x/i $pc
0x806f2e0 <semop+12>: (bad)

(gdb) disassemble
Dump of assembler code for function semop:
0x0806f2d4 <semop+0>: jmp *0x823f070
0x0806f2da <semop+6>: push $0x590
0x0806f2df <semop+11>: jmp 0x806e7a4 <_init+24>
End of assembler dump.

#2Bruce Momjian
bruce@momjian.us
In reply to: Prem Gopalan (#1)
Re: Postgres Crashes

Prem Gopalan wrote:

We run a multithreaded application that uses postgres 7.4 on Linux
2.4.18, dual cpu Xeon processor machine. We have occassional weird
crashes and have tried a lot of things to reproduce them in house, but
in vain. We do have coredumps and I have listed the backtraces and
their common characteristics here.

Whether your client is multi-threaded or not should have no affect on
the postmaster and any crashes you see there.

This part of the backtrace seems significant:

#3 0x081a767e in elog_finish (elevel=20, fmt=0x8235680 "invalid
memory alloc request size %lu")

I wonder if you are allocating too much memory. Looking at the 7.4 code
I see these all as ERROR, not FATAL (backend exits) or PANIC (postmaster
exits), so it shouldn't be crashing anything:

./backend/utils/mmgr/mcxt.c: elog(ERROR, "invalid memory
alloc request size %lu",
./backend/utils/mmgr/mcxt.c: elog(ERROR, "invalid memory
alloc request size %lu",
./backend/utils/mmgr/mcxt.c: elog(ERROR, "invalid memory
alloc request size %lu",
./backend/utils/mmgr/mcxt.c: elog(ERROR, "invalid memory
alloc request size %lu",

Would you send over a backtrace that shows more levels above this?

Can you reproduce this crash on demand?

I can't imagine why you would get this error. I wonder if you have a
problem with bad memory on that machine?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3F Harvell
fharvell@fts.net
In reply to: Prem Gopalan (#1)
Re: Postgres Crashes

This sounds very much like a memory problem. I would replace all of
the memory with another set of (preferably known good) memory and see
if the problems persist. Also look for other cores that may be
dropped. If there are several, memory is the likely cause. Be aware
that it will likely be active, large memory applications (of which
PostgreSQL may be the only one on the server) that will materialize the
issues.

Memory testing application may also show the problem, however, they
do not test like production use. I have had test apps run for weeks
where production use can cause failures in mere minutes. Also, note
that I have seen issues with bad CPU's (bad cache?) that have caused
similar problems.

On 30 Apr 2004, at 15:24, Prem Gopalan wrote:

Show quoted text

We run a multithreaded application that uses postgres 7.4 on Linux
2.4.18, dual cpu Xeon processor machine. We have occassional weird
crashes and have tried a lot of things to reproduce them in house, but
in vain. We do have coredumps and I have listed the backtraces and
their common characteristics here.

...

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Prem Gopalan (#1)
Re: Postgres Crashes

prem@mazunetworks.com (Prem Gopalan) writes:

The dying process is postmaster. After these crashes the server is
missing from ps and no more new connections are possible. The backend
processes stay on till their connections close.

That behavior does sound like a postmaster crash --- but all the stack
traces you show are clearly in backend code. A backend crash ought not
take out the postmaster. So something fairly odd is going on here.

What if anything shows up in the postmaster's stderr log when this
happens?

regards, tom lane

#5Prem Gopalan
prem@mazunetworks.com
In reply to: Tom Lane (#4)
Re: Postgres Crashes

After reproducing these crashes and running tests long enough, we found
that these Postgres Crashes happen on linux 2.4.18 and not on linux
2.4.25. In all likelihood, this is a kernel (or driver) issue.

Could you kindly ensure this gets on my mail thread with subject "Postgres
Crashes".

Thanks
Prem.

On Wed, 5 May 2004, Tom Lane wrote:

Show quoted text

prem@mazunetworks.com (Prem Gopalan) writes:

The dying process is postmaster. After these crashes the server is
missing from ps and no more new connections are possible. The backend
processes stay on till their connections close.

That behavior does sound like a postmaster crash --- but all the stack
traces you show are clearly in backend code. A backend crash ought not
take out the postmaster. So something fairly odd is going on here.

What if anything shows up in the postmaster's stderr log when this
happens?

regards, tom lane