Recent SIGSEGV failures in buildfarm HEAD

Started by Tom Laneabout 19 years ago24 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Several of the buildfarm machines are exhibiting repeatable signal 11
crashes in what seem perfectly ordinary queries. This started about
four days ago so I suppose it's got something to do with my
operator-families patch :-( ... but I dunno what, and none of my own
machines show the failure. Can someone provide a stack trace?

regards, tom lane

#2Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#1)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Several of the buildfarm machines are exhibiting repeatable signal 11
crashes in what seem perfectly ordinary queries. This started about
four days ago so I suppose it's got something to do with my
operator-families patch :-( ... but I dunno what, and none of my own
machines show the failure. Can someone provide a stack trace?

no stack trace yet however impala at least seems to be running out of
memory (!) with 380MB of RAM and some 800MB of swap(and no other tasks)
during the regression run. Maybe something is causing a dramatic
increase in memory usage that is causing the random failures (in impalas
case the OOM-killer actually decides to terminate the postmaster) ?

Stefan

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Kaltenbrunner (#2)
Re: Recent SIGSEGV failures in buildfarm HEAD

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

Tom Lane wrote:

Several of the buildfarm machines are exhibiting repeatable signal 11
crashes in what seem perfectly ordinary queries.

no stack trace yet however impala at least seems to be running out of
memory (!) with 380MB of RAM and some 800MB of swap(and no other tasks)
during the regression run. Maybe something is causing a dramatic
increase in memory usage that is causing the random failures (in impalas
case the OOM-killer actually decides to terminate the postmaster) ?

No, most all the failures I've looked at are sig11 not sig9.

It is interesting that the failures are not as consistent as I first
thought --- the machines that are showing failures actually fail maybe
one time in two.

regards, tom lane

#4Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#3)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

Tom Lane wrote:

Several of the buildfarm machines are exhibiting repeatable signal 11
crashes in what seem perfectly ordinary queries.

no stack trace yet however impala at least seems to be running out of
memory (!) with 380MB of RAM and some 800MB of swap(and no other tasks)
during the regression run. Maybe something is causing a dramatic
increase in memory usage that is causing the random failures (in impalas
case the OOM-killer actually decides to terminate the postmaster) ?

No, most all the failures I've looked at are sig11 not sig9.

hmm - still weird and I would not actually consider impala a resource
starved box (especially when compared to other buildfarm-members) so
there seems to be something strange going on.
I have changed the overcommit settings on that box for now - let's see
what the result of that will be.

It is interesting that the failures are not as consistent as I first
thought --- the machines that are showing failures actually fail maybe
one time in two.

or some even less - dove seems to be one of the affected boxes too - I
increased the build frequency since yesterday but it has not yet failed
again ...

Stefan

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Kaltenbrunner (#4)
Re: Recent SIGSEGV failures in buildfarm HEAD

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

... Maybe something is causing a dramatic
increase in memory usage that is causing the random failures (in impalas
case the OOM-killer actually decides to terminate the postmaster) ?

No, most all the failures I've looked at are sig11 not sig9.

hmm - still weird and I would not actually consider impala a resource
starved box (especially when compared to other buildfarm-members) so
there seems to be something strange going on.

Actually ... one way that a "memory overconsumption" bug could manifest
as sig11 would be if it's a runaway-recursion issue: usually you get sig11
when the machine's stack size limit is exceeded. This doesn't put us
any closer to localizing the problem, but at least it's a guess about
the cause?

I wonder whether there's any way to get the buildfarm script to report a
stack trace automatically if it finds a core file left behind in the
$PGDATA directory after running the tests. Would something like this
be adequately portable?

if [ -f $PGDATA/core* ]
then
echo bt | gdb $installdir/bin/postgres $PGDATA/core*
fi

Obviously it'd fail if no gdb available, but that seems pretty harmless.
The other thing that we'd likely need is an explicit "ulimit -c
unlimited" for machines where core dumps are off by default.

regards, tom lane

#6Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#5)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

Tom Lane wrote:

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

... Maybe something is causing a dramatic
increase in memory usage that is causing the random failures (in impalas
case the OOM-killer actually decides to terminate the postmaster) ?

No, most all the failures I've looked at are sig11 not sig9.

hmm - still weird and I would not actually consider impala a resource
starved box (especially when compared to other buildfarm-members) so
there seems to be something strange going on.

Actually ... one way that a "memory overconsumption" bug could manifest
as sig11 would be if it's a runaway-recursion issue: usually you get sig11
when the machine's stack size limit is exceeded. This doesn't put us
any closer to localizing the problem, but at least it's a guess about
the cause?

that sounds like a possibility though I'm not too optimistic this is
indeed the cause of the problem we see.

I wonder whether there's any way to get the buildfarm script to report a
stack trace automatically if it finds a core file left behind in the
$PGDATA directory after running the tests. Would something like this
be adequately portable?

if [ -f $PGDATA/core* ]
then
echo bt | gdb $installdir/bin/postgres $PGDATA/core*
fi

hmmm - not sure I like that that much

Obviously it'd fail if no gdb available, but that seems pretty harmless.
The other thing that we'd likely need is an explicit "ulimit -c
unlimited" for machines where core dumps are off by default.

there are other issues with that - gdb might be available but not
actually producing reliable results on certain platforms (some
commercial unixes,windows).

The thing we might might want to do is the buildfarm script overriding
keep_error_builds=0 conditionally in some cases (like detecting a core).

That way we will at least have a useful buildtree for later
examination(which would be removed even if we get a one-time stacktrace
and keep_error_builds is disabled)

Stefan

#7Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#5)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Actually ... one way that a "memory overconsumption" bug could manifest
as sig11 would be if it's a runaway-recursion issue: usually you get sig11
when the machine's stack size limit is exceeded. This doesn't put us
any closer to localizing the problem, but at least it's a guess about
the cause?

I wonder whether there's any way to get the buildfarm script to report a
stack trace automatically if it finds a core file left behind in the
$PGDATA directory after running the tests. Would something like this
be adequately portable?

if [ -f $PGDATA/core* ]
then
echo bt | gdb $installdir/bin/postgres $PGDATA/core*
fi

gdb has a "batch mode" which can be useful:

if [ -f $PGDATA/core* ]
then
gdb -ex "bt" --batch $installdir/bin/postgres $PGDATA/core*
fi

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#7)
1 attachment(s)
Re: Recent SIGSEGV failures in buildfarm HEAD

Alvaro Herrera wrote:

Tom Lane wrote:

I wonder whether there's any way to get the buildfarm script to report a
stack trace automatically if it finds a core file left behind in the
$PGDATA directory after running the tests. Would something like this
be adequately portable?

if [ -f $PGDATA/core* ]
then
echo bt | gdb $installdir/bin/postgres $PGDATA/core*
fi

gdb has a "batch mode" which can be useful:

if [ -f $PGDATA/core* ]
then
gdb -ex "bt" --batch $installdir/bin/postgres $PGDATA/core*
fi

here's a quick untested patch for buildfarm that Stefan might like to try.

cheers

andrew

Attachments:

btpatchtext/plain; name=btpatchDownload
--- run_build.pl.orig	2006-12-28 17:32:14.000000000 -0500
+++ run_build.pl.new	2006-12-28 17:58:51.000000000 -0500
@@ -795,6 +795,29 @@
 	$dbstarted=undef;
 }
 
+
+sub get_stack_trace
+{
+	my $bindir = shift;
+	my $pgdata = shift;
+
+	# no core = no result
+	return () unless -f "$pgdata/core";
+
+	# no gdb = no result
+	system "gdb --version > /dev/null 2>&1";
+	my $status = $? >>8;
+	return () if $status; 
+
+	my @trace = `gdb -ex bt --batch $bindir/postgres $pgdata/core 2>&1`;
+
+	unshift(@trace,
+			"\n\n================== stack trace ==================\n");
+
+	return @trace;
+
+}
+
 sub make_install_check
 {
 	my @checkout = `cd $pgsql/src/test/regress && $make installcheck 2>&1`;
@@ -814,6 +837,11 @@
 		}
 		close($handle);	
 	}
+	if ($status)
+	{
+		my @trace = get_stack_trace("$installdir/bin","$installdir/data");
+		push(@checkout,@trace);
+	}
 	writelog('install-check',\@checkout);
 	print "======== make installcheck log ===========\n",@checkout 
 		if ($verbose > 1);
@@ -839,6 +867,11 @@
 		}
 		close($handle);
 	}
+	if ($status)
+	{
+		my @trace = get_stack_trace("$installdir/bin","$installdir/data");
+		push(@checkout,@trace);
+	}
 	writelog('contrib-install-check',\@checkout);
 	print "======== make contrib installcheck log ===========\n",@checkout 
 		if ($verbose > 1);
@@ -864,6 +897,11 @@
 		}
 		close($handle);
 	}
+	if ($status)
+	{
+		my @trace = get_stack_trace("$installdir/bin","$installdir/data");
+		push(@checkout,@trace);
+	}
 	writelog('pl-install-check',\@checkout);
 	print "======== make pl installcheck log ===========\n",@checkout 
 		if ($verbose > 1);
@@ -892,6 +930,13 @@
 		}
 		close($handle);
 	}
+	if ($status)
+	{
+		my @trace = 
+			get_stack_trace("$pgsql/src/test/regress/install$installdir/bin",
+							"$pgsql/src/test/regress/tmp_check/data");
+		push(@makeout,@trace);
+	}
 	writelog('check',\@makeout);
 	print "======== make check logs ===========\n",@makeout 
 		if ($verbose > 1);
#9Alvaro Herrera
alvherre@commandprompt.com
In reply to: Andrew Dunstan (#8)
Re: Recent SIGSEGV failures in buildfarm HEAD

Andrew Dunstan wrote:

here's a quick untested patch for buildfarm that Stefan might like to try.

Note that not all core files are named "core". On some Linux distros,
it's configured to be "core.PID" by default. And you can even change it
to weirder names, but I haven't seen those anywhere by default, so I
guess supporting just the common ones is appropiate.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#9)
Re: Recent SIGSEGV failures in buildfarm HEAD

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

here's a quick untested patch for buildfarm that Stefan might like to try.

Note that not all core files are named "core". On some Linux distros,
it's configured to be "core.PID" by default.

And on some platforms, cores don't drop in the current working directory
... but until we have a problem that *only* manifests on such a
platform, I wouldn't worry about that. We do need to look for 'core*'
not just 'core', though.

Don't forget the ulimit point either ... on most Linuxen there won't be
any core at all without twiddling ulimit.

regards, tom lane

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#10)
1 attachment(s)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

here's a quick untested patch for buildfarm that Stefan might like to try.

Note that not all core files are named "core". On some Linux distros,
it's configured to be "core.PID" by default.

And on some platforms, cores don't drop in the current working directory
... but until we have a problem that *only* manifests on such a
platform, I wouldn't worry about that. We do need to look for 'core*'
not just 'core', though.

That part is easy enough. And if people mangle their core location I am
certainly not going to go looking for it.

Don't forget the ulimit point either ... on most Linuxen there won't be
any core at all without twiddling ulimit.

Yeah. Perl actually doesn't have a core call for this. I have built some
code (see attached revised patch) to try to do it using a widespread but
non-standard module called BSD::Resource, but if the module is missing
it won't fail.

I'm actually wondering if unlimiting core might not be a useful switch
to provide on pg_ctl, as long as the platform has setrlimit().

cheers

andrew

Attachments:

btpatchtext/plain; name=btpatchDownload
--- run_build.pl.orig	2006-12-28 17:32:14.000000000 -0500
+++ run_build.pl.new	2006-12-29 10:59:39.000000000 -0500
@@ -299,6 +299,20 @@
 	unlink $forcefile;
 }
 
+# try to allow core files to be produced.
+# another way would be for the calling environment
+# to call ulimit. We do this in an eval so failure is
+# not fatal.
+eval
+{
+	require BSD::Resource;
+	BSD::Resource->import();
+	# explicit sub calls here using & keeps compiler happy
+	my $coreok = setrlimit(&RLIMIT_CORE,&RLIM_INFINITY,&RLIM_INFINITY);
+	die "setrlimit" unless $coreok;
+};
+warn "failed to unlimit core size: $@" if $@;
+
 # the time we take the snapshot
 my $now=time;
 my $installdir = "$buildroot/$branch/inst";
@@ -795,6 +809,34 @@
 	$dbstarted=undef;
 }
 
+
+sub get_stack_trace
+{
+	my $bindir = shift;
+	my $pgdata = shift;
+
+	# no core = no result
+	my @cores = glob("$pgdata/core*");
+	return () unless @cores;
+
+	# no gdb = no result
+	system "gdb --version > /dev/null 2>&1";
+	my $status = $? >>8;
+	return () if $status; 
+
+	my @trace;
+
+	foreach my $core (@cores)
+	{
+		my @onetrace = `gdb -ex bt --batch $bindir/postgres $core 2>&1`;
+		push(@trace,
+			"\n\n================== stack trace: $core ==================\n",
+			 @onetrace);
+	}
+
+	return @trace;
+}
+
 sub make_install_check
 {
 	my @checkout = `cd $pgsql/src/test/regress && $make installcheck 2>&1`;
@@ -814,6 +856,11 @@
 		}
 		close($handle);	
 	}
+	if ($status)
+	{
+		my @trace = get_stack_trace("$installdir/bin","$installdir/data");
+		push(@checkout,@trace);
+	}
 	writelog('install-check',\@checkout);
 	print "======== make installcheck log ===========\n",@checkout 
 		if ($verbose > 1);
@@ -839,6 +886,11 @@
 		}
 		close($handle);
 	}
+	if ($status)
+	{
+		my @trace = get_stack_trace("$installdir/bin","$installdir/data");
+		push(@checkout,@trace);
+	}
 	writelog('contrib-install-check',\@checkout);
 	print "======== make contrib installcheck log ===========\n",@checkout 
 		if ($verbose > 1);
@@ -864,6 +916,11 @@
 		}
 		close($handle);
 	}
+	if ($status)
+	{
+		my @trace = get_stack_trace("$installdir/bin","$installdir/data");
+		push(@checkout,@trace);
+	}
 	writelog('pl-install-check',\@checkout);
 	print "======== make pl installcheck log ===========\n",@checkout 
 		if ($verbose > 1);
@@ -892,6 +949,13 @@
 		}
 		close($handle);
 	}
+	if ($status)
+	{
+		my @trace = 
+			get_stack_trace("$pgsql/src/test/regress/install$installdir/bin",
+							"$pgsql/src/test/regress/tmp_check/data");
+		push(@makeout,@trace);
+	}
 	writelog('check',\@makeout);
 	print "======== make check logs ===========\n",@makeout 
 		if ($verbose > 1);
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#11)
Re: Recent SIGSEGV failures in buildfarm HEAD

Andrew Dunstan <andrew@dunslane.net> writes:

I'm actually wondering if unlimiting core might not be a useful switch
to provide on pg_ctl, as long as the platform has setrlimit().

Not a bad thought; that's actually one of the reasons that I still
usually use a handmade script rather than pg_ctl for launching
postmasters ...

regards, tom lane

#13Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#12)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I'm actually wondering if unlimiting core might not be a useful switch
to provide on pg_ctl, as long as the platform has setrlimit().

Not a bad thought; that's actually one of the reasons that I still
usually use a handmade script rather than pg_ctl for launching
postmasters ...

this sounds like a good idea for me too - it seems like a cleaner and
more useful thing on a general base then just doing it in the buildfarm
code ...

Stefan

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Stefan Kaltenbrunner (#13)
1 attachment(s)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Stefan Kaltenbrunner wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I'm actually wondering if unlimiting core might not be a useful switch
to provide on pg_ctl, as long as the platform has setrlimit().

Not a bad thought; that's actually one of the reasons that I still
usually use a handmade script rather than pg_ctl for launching
postmasters ...

this sounds like a good idea for me too - it seems like a cleaner and
more useful thing on a general base then just doing it in the buildfarm
code ...

Draft patch attached. However, there will be some more work to do. For
one thing, pg_regress does not use pg_ctl to start its temp install
postmaster, so either we'll need to train it the same way or get it to
use pg_ctl. And then we'd need to change the regression makefile to use
the option, based on an environment variable a bit like MAX_CONNEXCTIONS
maybe.

cheers

andrew

Attachments:

ctlpatchtext/plain; name=ctlpatchDownload
Index: src/bin/pg_ctl/pg_ctl.c
===================================================================
RCS file: /cvsroot/pgsql/src/bin/pg_ctl/pg_ctl.c,v
retrieving revision 1.74
diff -c -r1.74 pg_ctl.c
*** src/bin/pg_ctl/pg_ctl.c	12 Oct 2006 05:14:49 -0000	1.74
--- src/bin/pg_ctl/pg_ctl.c	29 Dec 2006 21:08:39 -0000
***************
*** 26,31 ****
--- 26,36 ----
  #include <sys/stat.h>
  #include <unistd.h>
  
+ #ifdef HAVE_SYS_RESOURCE_H
+ #include <sys/time.h>
+ #include <sys/resource.h>
+ #endif
+ 
  #include "libpq/pqsignal.h"
  #include "getopt_long.h"
  
***************
*** 90,95 ****
--- 95,103 ----
  static char *register_username = NULL;
  static char *register_password = NULL;
  static char *argv0 = NULL;
+ #if HAVE_GETRLIMIT
+ static bool allow_core_files = false;
+ #endif
  
  static void
  write_stderr(const char *fmt,...)
***************
*** 132,137 ****
--- 140,149 ----
  static char pid_file[MAXPGPATH];
  static char conf_file[MAXPGPATH];
  
+ #if HAVE_GETRLIMIT
+ static void unlimit_core_size(void);
+ #endif
+ 
  
  #if defined(WIN32) || defined(__CYGWIN__)
  static void
***************
*** 478,483 ****
--- 490,516 ----
  }
  
  
+ #if HAVE_GETRLIMIT
+ static void 
+ unlimit_core_size(void)
+ {
+ 	struct rlimit lim;
+ 	getrlimit(RLIMIT_CORE,&lim);
+ 	if (lim.rlim_max == 0)
+ 	{
+ 			write_stderr(_("%s: cannot set core size,: disallowed by hard limit.\n"), 
+ 						 progname);
+ 			return;
+ 	}
+ 	else if (lim.rlim_max == RLIM_INFINITY || lim.rlim_cur < lim.rlim_max)
+ 	{
+ 		lim.rlim_cur = lim.rlim_max;
+ 		setrlimit(RLIMIT_CORE,&lim);
+ 	}	
+ }
+ #endif
+ 
+ 
  
  static void
  do_start(void)
***************
*** 581,586 ****
--- 614,624 ----
  		postgres_path = postmaster_path;
  	}
  
+ #if HAVE_GETRLIMIT
+ 	if (allow_core_files)
+ 		unlimit_core_size();
+ #endif
+ 
  	exitcode = start_postmaster();
  	if (exitcode != 0)
  	{
***************
*** 1401,1406 ****
--- 1439,1447 ----
  	printf(_("  -o OPTIONS             command line options to pass to postgres\n"
  			 "                         (PostgreSQL server executable)\n"));
  	printf(_("  -p PATH-TO-POSTGRES    normally not necessary\n"));
+ #if HAVE_GETRLIMIT
+ 	printf(_("  -c, --corefiles        allow postgres to produce core files\n"));
+ #endif
  
  	printf(_("\nOptions for stop or restart:\n"));
  	printf(_("  -m SHUTDOWN-MODE   may be \"smart\", \"fast\", or \"immediate\"\n"));
***************
*** 1497,1502 ****
--- 1538,1546 ----
  		{"mode", required_argument, NULL, 'm'},
  		{"pgdata", required_argument, NULL, 'D'},
  		{"silent", no_argument, NULL, 's'},
+ #if HAVE_GETRLIMIT
+ 		{"corefiles", no_argument, NULL, 'c'},
+ #endif
  		{NULL, 0, NULL, 0}
  	};
  
***************
*** 1561,1567 ****
  	/* process command-line options */
  	while (optind < argc)
  	{
! 		while ((c = getopt_long(argc, argv, "D:l:m:N:o:p:P:sU:wW", long_options, &option_index)) != -1)
  		{
  			switch (c)
  			{
--- 1605,1611 ----
  	/* process command-line options */
  	while (optind < argc)
  	{
! 		while ((c = getopt_long(argc, argv, "cD:l:m:N:o:p:P:sU:wW", long_options, &option_index)) != -1)
  		{
  			switch (c)
  			{
***************
*** 1632,1637 ****
--- 1676,1686 ----
  					do_wait = false;
  					wait_set = true;
  					break;
+ #if HAVE_GETRLIMIT
+ 				case 'c':
+ 					allow_core_files = true;
+ 					break;
+ #endif
  				default:
  					/* getopt_long already issued a suitable error message */
  					do_advice();
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#14)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Andrew Dunstan <andrew@dunslane.net> writes:

... And then we'd need to change the regression makefile to use
the option, based on an environment variable a bit like MAX_CONNEXCTIONS
maybe.

Why wouldn't we just use it always? If a regression test dumps core,
that's going to deserve investigation.

regards, tom lane

#16Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#15)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... And then we'd need to change the regression makefile to use
the option, based on an environment variable a bit like MAX_CONNEXCTIONS
maybe.

Why wouldn't we just use it always? If a regression test dumps core,
that's going to deserve investigation.

enabling it always for the regression tests probably makes sense - but
there is also the possibility that such a core can get very large and
potentially run the partitition the regression test runs on out of space.

Stefan

#17Andrew Dunstan
andrew@dunslane.net
In reply to: Stefan Kaltenbrunner (#16)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Stefan Kaltenbrunner wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... And then we'd need to change the regression makefile to use
the option, based on an environment variable a bit like
MAX_CONNEXCTIONS
maybe.

Why wouldn't we just use it always? If a regression test dumps core,
that's going to deserve investigation.

enabling it always for the regression tests probably makes sense - but
there is also the possibility that such a core can get very large and
potentially run the partitition the regression test runs on out of space.

I think Tom is right. You can always set the hard limit before calling
"make check" or running the buildfarm script. I'll prepare a patch to use
similar code unconditionally in pg_regress.

cheers

andrew

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Recent SIGSEGV failures in buildfarm HEAD

Seneca Cunningham <tentra@gmail.com> writes:

I don't have a core, but here's the CrashReporter output for both
of jackal's failed runs:

Wow, some actual data, rather than just noodling about how to get it ...
thanks!

...
11 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496)
12 postgres 0x00020868 relation_open + 84 (heapam.c:697)
13 postgres 0x0002aab9 index_open + 32 (indexam.c:140)
14 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184)
15 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200)
16 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866)
17 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496)
18 postgres 0x00020868 relation_open + 84 (heapam.c:697)
19 postgres 0x0002aab9 index_open + 32 (indexam.c:140)
20 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184)
21 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200)
22 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866)
23 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496)
...

What you seem to have here is infinite recursion during relcache
initialization. That's surely not hard to believe, considering I just
whacked that code around, and indeed changed some of the tests that are
intended to prevent such recursion. But what I don't understand is why
it'd be platform-specific, much less not perfectly repeatable on the
platforms where it does manifest. Anyone have a clue?

regards, tom lane

#19Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#18)
Re: Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Seneca Cunningham <tentra@gmail.com> writes:

I don't have a core, but here's the CrashReporter output for both
of jackal's failed runs:

Wow, some actual data, rather than just noodling about how to get it ...
thanks!

...
11 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496)
12 postgres 0x00020868 relation_open + 84 (heapam.c:697)
13 postgres 0x0002aab9 index_open + 32 (indexam.c:140)
14 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184)
15 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200)
16 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866)
17 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496)
18 postgres 0x00020868 relation_open + 84 (heapam.c:697)
19 postgres 0x0002aab9 index_open + 32 (indexam.c:140)
20 postgres 0x0002a9d4 systable_beginscan + 289 (genam.c:184)
21 postgres 0x002279e4 RelationInitIndexAccessInfo + 1645 (relcache.c:1200)
22 postgres 0x0022926a RelationBuildDesc + 3527 (relcache.c:866)
23 postgres 0x0022b2e3 RelationIdGetRelation + 110 (relcache.c:1496)
...

What you seem to have here is infinite recursion during relcache
initialization. That's surely not hard to believe, considering I just
whacked that code around, and indeed changed some of the tests that are
intended to prevent such recursion. But what I don't understand is why
it'd be platform-specific, much less not perfectly repeatable on the
platforms where it does manifest. Anyone have a clue?

fwiw - I can trigger that issue now pretty reliably on a fast Opteron
box (running Debian Sarge/AMD64) with make regress in a loop - I seem to
be able to trigger it in about 20-25% of the runs.
the resulting core however looks totally stack corrupted and not really
usable :-(

Stefan

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Kaltenbrunner (#19)
Re: Recent SIGSEGV failures in buildfarm HEAD

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:

fwiw - I can trigger that issue now pretty reliably on a fast Opteron
box (running Debian Sarge/AMD64) with make regress in a loop - I seem to
be able to trigger it in about 20-25% of the runs.
the resulting core however looks totally stack corrupted and not really
usable :-(

Hmm, probably the stack overrun leaves the call stack too corrupt for
gdb to make sense of. Try inserting "check_stack_depth();" into one
of the functions that're part of the infinite recursion, and then make
check_stack_depth() do an abort() instead of just elog(ERROR). That
might give you a core that gdb can work with.

I'm still having absolutely 0 success reproducing it on a dual Xeon
... so it's not just the architecture that's the issue. Some kind of
timing problem? That's hard to believe too.

regards, tom lane

#21Seneca Cunningham
tentra@gmail.com
In reply to: Stefan Kaltenbrunner (#19)
Re: Recent SIGSEGV failures in buildfarm HEAD

On Sun, Dec 31, 2006 at 05:43:45PM +0100, Stefan Kaltenbrunner wrote:

Tom Lane wrote:

What you seem to have here is infinite recursion during relcache
initialization. That's surely not hard to believe, considering I just
whacked that code around, and indeed changed some of the tests that are
intended to prevent such recursion. But what I don't understand is why
it'd be platform-specific, much less not perfectly repeatable on the
platforms where it does manifest. Anyone have a clue?

fwiw - I can trigger that issue now pretty reliably on a fast Opteron
box (running Debian Sarge/AMD64) with make regress in a loop - I seem to
be able to trigger it in about 20-25% of the runs.
the resulting core however looks totally stack corrupted and not really
usable :-(

By reducing the stack size on jackal from the default of 8MB to 3MB, I
can get this to trigger in roughly 30% of the runs while preserving the
passed tests in the other parallel groups.

--
Seneca
tentra@gmail.com

#22Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#15)
1 attachment(s)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... And then we'd need to change the regression makefile to use
the option, based on an environment variable a bit like MAX_CONNEXCTIONS
maybe.

Why wouldn't we just use it always? If a regression test dumps core,
that's going to deserve investigation.

Revised patch attached, doing just this. I will apply it soon unless
there are objections.

cheers

andrew

Attachments:

corepatchtext/plain; name=corepatchDownload
Index: doc/src/sgml/ref/pg_ctl-ref.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/pg_ctl-ref.sgml,v
retrieving revision 1.35
diff -c -r1.35 pg_ctl-ref.sgml
*** doc/src/sgml/ref/pg_ctl-ref.sgml	2 Dec 2006 00:34:52 -0000	1.35
--- doc/src/sgml/ref/pg_ctl-ref.sgml	2 Jan 2007 20:25:01 -0000
***************
*** 29,34 ****
--- 29,35 ----
     <arg>-l <replaceable>filename</replaceable></arg>
     <arg>-o <replaceable>options</replaceable></arg>
     <arg>-p <replaceable>path</replaceable></arg>
+    <arg>-c</arg>
     <sbr>
     <command>pg_ctl</command>
     <arg choice="plain">stop</arg>
***************
*** 48,53 ****
--- 49,55 ----
     <arg>-w</arg>
     <arg>-s</arg>
     <arg>-D <replaceable>datadir</replaceable></arg>
+    <arg>-c</arg>
     <arg>-m
       <group choice="plain">
         <arg>s[mart]</arg>
***************
*** 246,251 ****
--- 248,266 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>-c</option></term>
+       <listitem>
+        <para>
+         Attempt to allow server crashes to produce core files, on platforms
+         where this available, by lifting any soft resource limit placed on 
+ 		them. 
+ 		This is useful in debugging or diagnosing problems by allowing a 
+ 		stack trace to be obtained from a failed server process.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-w</option></term>
        <listitem>
         <para>
Index: src/bin/pg_ctl/pg_ctl.c
===================================================================
RCS file: /cvsroot/pgsql/src/bin/pg_ctl/pg_ctl.c,v
retrieving revision 1.74
diff -c -r1.74 pg_ctl.c
*** src/bin/pg_ctl/pg_ctl.c	12 Oct 2006 05:14:49 -0000	1.74
--- src/bin/pg_ctl/pg_ctl.c	2 Jan 2007 20:25:02 -0000
***************
*** 26,31 ****
--- 26,36 ----
  #include <sys/stat.h>
  #include <unistd.h>
  
+ #ifdef HAVE_SYS_RESOURCE_H
+ #include <sys/time.h>
+ #include <sys/resource.h>
+ #endif
+ 
  #include "libpq/pqsignal.h"
  #include "getopt_long.h"
  
***************
*** 90,95 ****
--- 95,103 ----
  static char *register_username = NULL;
  static char *register_password = NULL;
  static char *argv0 = NULL;
+ #if HAVE_GETRLIMIT
+ static bool allow_core_files = false;
+ #endif
  
  static void
  write_stderr(const char *fmt,...)
***************
*** 132,137 ****
--- 140,149 ----
  static char pid_file[MAXPGPATH];
  static char conf_file[MAXPGPATH];
  
+ #if HAVE_GETRLIMIT
+ static void unlimit_core_size(void);
+ #endif
+ 
  
  #if defined(WIN32) || defined(__CYGWIN__)
  static void
***************
*** 478,483 ****
--- 490,516 ----
  }
  
  
+ #if HAVE_GETRLIMIT
+ static void 
+ unlimit_core_size(void)
+ {
+ 	struct rlimit lim;
+ 	getrlimit(RLIMIT_CORE,&lim);
+ 	if (lim.rlim_max == 0)
+ 	{
+ 			write_stderr(_("%s: cannot set core size,: disallowed by hard limit.\n"), 
+ 						 progname);
+ 			return;
+ 	}
+ 	else if (lim.rlim_max == RLIM_INFINITY || lim.rlim_cur < lim.rlim_max)
+ 	{
+ 		lim.rlim_cur = lim.rlim_max;
+ 		setrlimit(RLIMIT_CORE,&lim);
+ 	}	
+ }
+ #endif
+ 
+ 
  
  static void
  do_start(void)
***************
*** 581,586 ****
--- 614,624 ----
  		postgres_path = postmaster_path;
  	}
  
+ #if HAVE_GETRLIMIT
+ 	if (allow_core_files)
+ 		unlimit_core_size();
+ #endif
+ 
  	exitcode = start_postmaster();
  	if (exitcode != 0)
  	{
***************
*** 1401,1406 ****
--- 1439,1447 ----
  	printf(_("  -o OPTIONS             command line options to pass to postgres\n"
  			 "                         (PostgreSQL server executable)\n"));
  	printf(_("  -p PATH-TO-POSTGRES    normally not necessary\n"));
+ #if HAVE_GETRLIMIT
+ 	printf(_("  -c, --corefiles        allow postgres to produce core files\n"));
+ #endif
  
  	printf(_("\nOptions for stop or restart:\n"));
  	printf(_("  -m SHUTDOWN-MODE   may be \"smart\", \"fast\", or \"immediate\"\n"));
***************
*** 1497,1502 ****
--- 1538,1546 ----
  		{"mode", required_argument, NULL, 'm'},
  		{"pgdata", required_argument, NULL, 'D'},
  		{"silent", no_argument, NULL, 's'},
+ #if HAVE_GETRLIMIT
+ 		{"corefiles", no_argument, NULL, 'c'},
+ #endif
  		{NULL, 0, NULL, 0}
  	};
  
***************
*** 1561,1567 ****
  	/* process command-line options */
  	while (optind < argc)
  	{
! 		while ((c = getopt_long(argc, argv, "D:l:m:N:o:p:P:sU:wW", long_options, &option_index)) != -1)
  		{
  			switch (c)
  			{
--- 1605,1611 ----
  	/* process command-line options */
  	while (optind < argc)
  	{
! 		while ((c = getopt_long(argc, argv, "cD:l:m:N:o:p:P:sU:wW", long_options, &option_index)) != -1)
  		{
  			switch (c)
  			{
***************
*** 1632,1637 ****
--- 1676,1686 ----
  					do_wait = false;
  					wait_set = true;
  					break;
+ #if HAVE_GETRLIMIT
+ 				case 'c':
+ 					allow_core_files = true;
+ 					break;
+ #endif
  				default:
  					/* getopt_long already issued a suitable error message */
  					do_advice();
Index: src/test/regress/pg_regress.c
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/pg_regress.c,v
retrieving revision 1.23
diff -c -r1.23 pg_regress.c
*** src/test/regress/pg_regress.c	4 Oct 2006 00:30:14 -0000	1.23
--- src/test/regress/pg_regress.c	2 Jan 2007 20:25:03 -0000
***************
*** 24,29 ****
--- 24,34 ----
  #include <signal.h>
  #include <unistd.h>
  
+ #ifdef HAVE_SYS_RESOURCE_H
+ #include <sys/time.h>
+ #include <sys/resource.h>
+ #endif
+ 
  #include "getopt_long.h"
  #include "pg_config_paths.h"
  
***************
*** 122,127 ****
--- 127,156 ----
     the supplied arguments. */
  __attribute__((format(printf, 2, 3)));
  
+ /*
+  * allow core files if possible.
+  */
+ #if HAVE_GETRLIMIT
+ static void 
+ unlimit_core_size(void)
+ {
+ 	struct rlimit lim;
+ 	getrlimit(RLIMIT_CORE,&lim);
+ 	if (lim.rlim_max == 0)
+ 	{
+ 		fprintf(stderr,
+ 				_("%s: cannot set core size,: disallowed by hard limit.\n"), 
+ 				progname);
+ 		return;
+ 	}
+ 	else if (lim.rlim_max == RLIM_INFINITY || lim.rlim_cur < lim.rlim_max)
+ 	{
+ 		lim.rlim_cur = lim.rlim_max;
+ 		setrlimit(RLIMIT_CORE,&lim);
+ 	}	
+ }
+ #endif
+ 
  
  /*
   * Add an item at the end of a stringlist.
***************
*** 1459,1464 ****
--- 1488,1497 ----
  
  	initialize_environment();
  
+ #if HAVE_GETRLIMIT
+ 	unlimit_core_size();
+ #endif
+ 
  	if (temp_install)
  	{
  		/*
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#22)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Andrew Dunstan <andrew@dunslane.net> writes:

Revised patch attached, doing just this. I will apply it soon unless
there are objections.

Probably a good idea to check defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE),
rather than naively assuming every getrlimit implementation supports
that particular setting. Also, should the -c option exist but just not
do anything if the platform doesn't support it? As is, you're making it
impossible to just specify -c without worrying if it does anything.

The documentation fails to list the long form of the switch
(--corefiles, which should probably really be --core-files for consistency).
There's a typo in this message, too:

+ _("%s: cannot set core size,: disallowed by hard limit.\n"),

regards, tom lane

#24Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#23)
Re: [HACKERS] Recent SIGSEGV failures in buildfarm HEAD

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Revised patch attached, doing just this. I will apply it soon unless
there are objections.

Probably a good idea to check defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE),
rather than naively assuming every getrlimit implementation supports
that particular setting. Also, should the -c option exist but just not
do anything if the platform doesn't support it? As is, you're making it
impossible to just specify -c without worrying if it does anything.

The documentation fails to list the long form of the switch
(--corefiles, which should probably really be --core-files for consistency).
There's a typo in this message, too:

+ _("%s: cannot set core size,: disallowed by hard limit.\n"),

OK, I'll fix all this. Thanks.

cheers

andrew