old 7.2 bug?

Started by Patrick Welchealmost 24 years ago6 messagesgeneral
Jump to latest
#1Patrick Welche
prlw1@newn.cam.ac.uk

I have had an odd core dump in a bit of code that runs every hour since March
2001, and for some reason just dumped core...

PostgreSQL 7.2 on i386-unknown-netbsdelf1.5ZC, compiled by GCC 2.95.3
fe-exec.c:
$Header: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-exec.c,v 1.113 2001/10/25 05:50:13 momjian Exp $

(gdb) bt
#0 0x480f1d63 in strtol () from /usr/lib/libc.so.12
#1 0x480f1e53 in strtol () from /usr/lib/libc.so.12
#2 0x480f257d in strtol () from /usr/lib/libc.so.12
#3 0x480656fd in pqResultAlloc (res=0x81cfa80, nBytes=32, isBinary=1 '\001')
at fe-exec.c:392
#4 0x48065ffd in getRowDescriptions (conn=0x8050000) at fe-exec.c:1025
#5 0x48065e16 in parseInput (conn=0x8050000) at fe-exec.c:919
#6 0x4806648d in PQgetResult (conn=0x8050000) at fe-exec.c:1249
#7 0x48066629 in PQexec (conn=0x8050000,
query=0x804e080 "SELECT MAX(fromoctets),MAX(tooctets) FROM stats,trans WHERE stats.id=stats_id AND firsttimei=494282304 AND sourcepeeraddress='xxx.yyy.zzz.aaa' AND flowindex=4455 AND timeslice<'20:40:00 Thu 4 Jul 200"...)
at fe-exec.c:1362
#8 0x8049a01 in dladdr ()
#9 0x8049145 in dladdr ()
#10 0x8048d5c in free ()
(gdb)

This is from within a C program (odd that main() and yyparse() don't appear
above...) and running the query by hand just gives:

max | max
------+------
8100 | 8100
(1 row)

The code at pqResultAlloc() is:

/* Otherwise, start a new block. */
block = (PGresult_data *) malloc(PGRESULT_DATA_BLOCKSIZE);
if (!block)
return NULL;
block->next = res->curBlock;

(gdb) frame 3
#3 0x480656fd in pqResultAlloc (res=0x81cfa80, nBytes=32, isBinary=1 '\001')
at fe-exec.c:392
392 block = (PGresult_data *) malloc(PGRESULT_DATA_BLOCKSIZE);
(gdb) print block
$5 = (PGresult_data *) 0x48122000
(gdb) print *block
$6 = {next = 0x0, space = ""}
(gdb) print *res
$7 = {ntups = 0, numAttributes = 2, attDescs = 0x0, tuples = 0x0,
tupArrSize = 0, resultStatus = PGRES_TUPLES_OK,
cmdStatus = '\000' <repeats 39 times>, binary = 0, xconn = 0x8050000,
noticeHook = 0x4806540c <defaultNoticeProcessor>, noticeArg = 0x0,
client_encoding = 0, errMsg = 0x0, null_field = "", curBlock = 0x0,

Upgrade/ignore/comments?

Cheers,

Patrick

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Patrick Welche (#1)
Re: old 7.2 bug?

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

I have had an odd core dump in a bit of code that runs every hour since March
2001, and for some reason just dumped core...

Is the crash reproducible if you run it again?

regards, tom lane

#3Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Tom Lane (#2)
Re: old 7.2 bug?

On Sat, Jul 06, 2002 at 03:53:15PM -0400, Tom Lane wrote:

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

I have had an odd core dump in a bit of code that runs every hour since March
2001, and for some reason just dumped core...

Is the crash reproducible if you run it again?

No, and there wasn't anything special about the line it died on. The
comment

/* Otherwise, start a new block. */
block = (PGresult_data *) malloc(PGRESULT_DATA_BLOCKSIZE);

seems to suggest its a memory thing? It just happened to be one row too
many? But then, the file wasn't *that* big - compared to others that went
through OK..

Cheers,

Patrick

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Patrick Welche (#3)
Re: old 7.2 bug?

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

seems to suggest its a memory thing? It just happened to be one row too
many?

But the query was only one row, no? Seems unlikely that it ran out of
memory (especially if the core dump file is not large).

I suspect one of two possibilities:

1. Hardware glitch (cosmic ray flipped a bit in memory, or something).

2. Software bug caused a memory stomp on malloc's private data
structures, leading to coredump in a later malloc call.

Malloc is frequently the first thing to fall over when you have the
kind of bug that writes a few more bytes than it's supposed to ---
because what it'll likely clobber is the header of the next malloc
block.

If you think #2 seems plausible then I suggest relinking your client
with a debugging malloc package to see what you can learn. I've used
Electric Fence for this sort of problem once or twice, but perhaps
others have better suggestions (efence is pretty old).

regards, tom lane

#5Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Tom Lane (#4)
Re: old 7.2 bug?

On Sat, Jul 06, 2002 at 05:27:00PM -0400, Tom Lane wrote:

Patrick Welche <prlw1@newn.cam.ac.uk> writes:

seems to suggest its a memory thing? It just happened to be one row too
many?

But the query was only one row, no? Seems unlikely that it ran out of
memory (especially if the core dump file is not large).

...

Thank you for the suggestions. It being not reproducible means the problem
will be hard to find (still wonder why yyparse() didn't appear in the
core dump..), or of course play Russian roulette - one core dump in over a
year isn't exactly bad :)

Cheers,

Patrick

#6Mike Castle
dalgoda@ix.netcom.com
In reply to: Patrick Welche (#3)
Re: old 7.2 bug?

In article <4744.1025990820@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

with a debugging malloc package to see what you can learn. I've used
Electric Fence for this sort of problem once or twice, but perhaps
others have better suggestions (efence is pretty old).

I wouldn't consider efenct old. Just stable. :->

Actually, with efence, you probably don't have to relink, but instead, run
the efence script. It will preload the efence shared lib and so you'll
grab it's malloc instead of the system one. Only works for dynamically
linked binaries, of course.

mrc
--
Mike Castle dalgoda@ix.netcom.com www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc