Problems with recent CVS versions and Solaris.

Started by Keith Parksover 25 years ago10 messages
#1Keith Parks
emkxp01@mtcc.demon.co.uk

Hi all,

I regularly do a "cvs update" and compile and test PostgreSQL.

Recently, since about 1 week, I've had a nasty problem.

Doing an "initdb" seems to suck up all available memory and almost
kills the system, before dying itself with a SEGV.

The problem postgress process is:-

/usr/local/pgsql/bin/postgres -boot -x -C -F -D/usr/local/pgsql/data -d
template1

The system becomes VERY unresponsive when this postgres process
starts running, so difficult to attach to with gdb.

I'm stuck for a clue as to how to debug this.

Is anyone else seeing this problem recently?

Is it just a Solaris problem?
(Solaris 2.6 on SPARCstation 5)

Is it just me? :-(

Help,

Keith.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Keith Parks (#1)
Re: Problems with recent CVS versions and Solaris.

Keith Parks <emkxp01@mtcc.demon.co.uk> writes:

Recently, since about 1 week, I've had a nasty problem.
Doing an "initdb" seems to suck up all available memory and almost
kills the system, before dying itself with a SEGV.

Hmm --- no such problem noted here, and I've been doing lots of initdbs...

It must be somewhat platform-specific. See if you can get a coredump
and backtrace.

regards, tom lane

#3Keith Parks
emkxp01@mtcc.demon.co.uk
In reply to: Tom Lane (#2)
2 attachment(s)
Re: Problems with recent CVS versions and Solaris.

Oops, mailed it to myself instead of the list!

It's been a long day...

------------- Begin Forwarded Message -------------

Date: Thu, 1 Jun 2000 23:31:01 +0100 (BST)
From: Keith Parks <emkxp01@mtcc.demon.co.uk>
Subject: Re: [HACKERS] Problems with recent CVS versions and Solaris.
To: emkxp01@mtcc.demon.co.uk
MIME-Version: 1.0

I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's
excellent example from the archives, also attached.

I'm not sure whether the stack frame shown is corrupt, it seems to just
loop over and over again. (I got fed up after 400+ frames)

The final few frames show us asking for more memory, the point at
which things seem to go out of control.

#0 0xef5d33b8 in _brk_unlocked ()
#1 0xef5ce2f8 in _sbrk_unlocked ()
#2 0xef5ce26c in sbrk ()
#3 0xef585bb0 in _morecore ()
#4 0xef58549c in _malloc_unlocked ()
#5 0xef5852b4 in malloc ()
#6 0x139198 in AllocSetAlloc (set=0x1bea10, size=4032) at aset.c:285
#7 0x139ea8 in GlobalMemoryAlloc (this=0x1bea08, size=4008) at mcxt.c:419
#8 0x1399ec in MemoryContextAlloc (context=0x1bea08, size=4008) at mcxt.c:224
#9 0x12c700 in InitSysCache (relname=0x180f40 "pg_proc",
iname=0x180f08 "pg_proc_oid_index", id=18, nkeys=1, key=0x19a2f0,
iScanfuncP=0x6e1c8 <ProcedureOidIndexScan>) at catcache.c:705
#10 0x1312d8 in SearchSysCacheTuple (cacheId=18, key1=184, key2=0, key3=0,
key4=0) at syscache.c:509

Is this any help?

I'm no expert in gdb, but I can follow instructions. ;-)

Thanks,
Keith.

Keith Parks <emkxp01@mtcc.demon.co.uk>

Hi all,

I regularly do a "cvs update" and compile and test PostgreSQL.

Recently, since about 1 week, I've had a nasty problem.

Doing an "initdb" seems to suck up all available memory and almost
kills the system, before dying itself with a SEGV.

The problem postgress process is:-

/usr/local/pgsql/bin/postgres -boot -x -C -F -D/usr/local/pgsql/data -d
template1

The system becomes VERY unresponsive when this postgres process
starts running, so difficult to attach to with gdb.

I'm stuck for a clue as to how to debug this.

Is anyone else seeing this problem recently?

Is it just a Solaris problem?
(Solaris 2.6 on SPARCstation 5)

Is it just me? :-(

Help,

Keith.

------------- End Forwarded Message -------------

Attachments:

initdb_debug.txt.Zapplication/x-sun-compress; name=initdb_debug.txt.ZDownload
initdb_debug_session.txt.Zapplication/x-sun-compress; name=initdb_debug_session.txt.ZDownload
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Keith Parks (#3)
Re: Problems with recent CVS versions and Solaris.

Keith Parks <emkxp01@mtcc.demon.co.uk> writes:

I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's
excellent example from the archives, also attached.

I'm not sure whether the stack frame shown is corrupt, it seems to just
loop over and over again. (I got fed up after 400+ frames)

What we've got here is the syscache trying to set up for a search of
cache 18, which I believe is the pg_proc-indexed-on-OID cache.
For that it needs the OID comparison function, "oideq" (OID 184).
It's asking the funcmgr for oideq ... and funcmgr is turning around
and asking the syscache for the pg_proc entry with OID 184. Ooops.

I thought there was an interlock in there to report a useful message if
a syscache got called recursively like this. Have to look at why it's
not working. However, I guess your real problem is that the funcmgr is
failing to find proc OID 184 in its own table of built-in functions.
The reason this isn't a recursion under normal circumstances is that the
comparison functions the syscaches need are all supposed to be hardwired
into fmgr.

My bet is that there is something snafu'd in your generation of
fmgrtab.c from pg_proc.h via Gen_fmgrtab.sh, such that your table of
builtin functions is either empty or corrupt.

Before wasting any additional time on it I'd recommend a make distclean,
cvs update, configure and rebuild from scratch to see if the problem
persists. I changed the Gen_fmgrtab.sh setup last week as part of the
first round of fmgr checkins, and I wouldn't be surprised to find that
you've just gotten burnt by out-of-sync files or some such (eg, a local
file that needs to be rebuilt but is timestamped a bit newer than the
cvs-supplied files it depends on).

If you still see the problem with a virgin build, take a look at
src/backend/utils/Gen_fmgrtab.sh and its output
src/backend/utils/fmgrtab.c to see if you can figure out what's
wrong. Could be that I introduced some kind of portability problem
into Gen_fmgrtab.sh ...

regards, tom lane

#5Keith Parks
emkxp01@mtcc.demon.co.uk
In reply to: Tom Lane (#4)
Re: Problems with recent CVS versions and Solaris.

Tom,

You ain't arf clever.

Running Gen_fmgrtab.sh with a "set -x" shows:-

const FmgrBuiltin fmgr_builtins[] = {
+ awk { printf ("  { %d, \"%s\", %d, %s, %s, %s },\n"), \
        $1, $(NF-1), $9, \
        ($8 == "t") ? "true" : "false", \
        ($4 == "11") ? "true" : "false", \
        $(NF-1) } fmgr.raw 
awk: syntax error near line 3
awk: illegal statement near line 3
+ cat 
  /* dummy entry is easier than getting rid of comma after last real one */
  { 0, NULL, 0, false, false, (PGFunction) NULL }
};

/* Note fmgr_nbuiltins excludes the dummy entry */
const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)) - 1;

Looks like the problem is that, Solaris's awk is "old" awk.

If I change the awk to nawk I get valid output.

I'm just about to start the clean build process with this change.

Once it's started I'm off to bed. Will check in the morning.

Thanks for your trouble, we just need a "portable" fix now.

Thanks,
Keith.

Tom Lane <tgl@sss.pgh.pa.us>

Keith Parks <emkxp01@mtcc.demon.co.uk> writes:

I've managed to get a backtrace, attached, thanks to Ross J. Reedstrom's
excellent example from the archives, also attached.

I'm not sure whether the stack frame shown is corrupt, it seems to just
loop over and over again. (I got fed up after 400+ frames)

What we've got here is the syscache trying to set up for a search of
cache 18, which I believe is the pg_proc-indexed-on-OID cache.
For that it needs the OID comparison function, "oideq" (OID 184).
It's asking the funcmgr for oideq ... and funcmgr is turning around
and asking the syscache for the pg_proc entry with OID 184. Ooops.

<snip>

My bet is that there is something snafu'd in your generation of
fmgrtab.c from pg_proc.h via Gen_fmgrtab.sh, such that your table of
builtin functions is either empty or corrupt.

<snip>

Show quoted text

If you still see the problem with a virgin build, take a look at
src/backend/utils/Gen_fmgrtab.sh and its output
src/backend/utils/fmgrtab.c to see if you can figure out what's
wrong. Could be that I introduced some kind of portability problem
into Gen_fmgrtab.sh ...

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Keith Parks (#5)
Re: Problems with recent CVS versions and Solaris.

Keith Parks <emkxp01@mtcc.demon.co.uk> writes:

Running Gen_fmgrtab.sh with a "set -x" shows:-

const FmgrBuiltin fmgr_builtins[] = {
+ awk { printf (" { %d, \"%s\", %d, %s, %s, %s },\n"), \
$1, $(NF-1), $9, \
($8 == "t") ? "true" : "false", \
($4 == "11") ? "true" : "false", \
$(NF-1) } fmgr.raw
awk: syntax error near line 3
awk: illegal statement near line 3

Ugh. I think that the former version of the script didn't use
conditional expressions (a ? b : c). Perhaps old versions of
awk don't have those? If so we can probably work around it...

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: Problems with recent CVS versions and Solaris.

Ugh. I think that the former version of the script didn't use
conditional expressions (a ? b : c). Perhaps old versions of
awk don't have those?

Indeed, the GNU awk manual says so very clearly :-(

Keith, I've committed a new version of Gen_fmgrtab.sh.in;
would you check that it works on your copy of awk?

regards, tom lane

#8Keith Parks
emkxp01@mtcc.demon.co.uk
In reply to: Tom Lane (#7)
Re: Problems with recent CVS versions and Solaris.

Thanks Tom,

That's fixed it.

It's a shame when you have to "dumb-down" your AWK programming
to suit the lowest common standard :-(

Thanks again,
Keith.

Tom Lane <tgl@sss.pgh.pa.us>

Show quoted text

Ugh. I think that the former version of the script didn't use
conditional expressions (a ? b : c). Perhaps old versions of
awk don't have those?

Indeed, the GNU awk manual says so very clearly :-(

Keith, I've committed a new version of Gen_fmgrtab.sh.in;
would you check that it works on your copy of awk?

regards, tom lane

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#6)
Re: Problems with recent CVS versions and Solaris.

Tom Lane writes:

Ugh. I think that the former version of the script didn't use
conditional expressions (a ? b : c). Perhaps old versions of
awk don't have those? If so we can probably work around it...

While you're at it, you should use AC_PROG_AWK to potentially find the
most modern and fastest awk on the system. Also, it seems that script has
really little to no checking of exit statuses. A segfault during initdb is
a really obscure place to find out about awk syntax errors.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#9)
Re: Problems with recent CVS versions and Solaris.

Peter Eisentraut <peter_e@gmx.net> writes:

While you're at it, you should use AC_PROG_AWK to potentially find the
most modern and fastest awk on the system. Also, it seems that script has
really little to no checking of exit statuses.

True. Wanna fix it? I'm not planning to touch it again soon...

regards, tom lane