Intermittent stats test failures on buildfarm

Started by Tom Laneover 20 years ago6 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I just spent a tedious hour digging through the buildfarm results
to see what I could learn about the intermittent failures we're seeing
in the stats regression test, such as here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=ferret&dt=2005-05-29%2018:25:09
This is seen in both Check and InstallCheck steps. A variant pathology
is seen here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&dt=2005-07-22%2007:58:01
Notice that only the heap stats columns are wrong in this case, not the
index stats. I think that this variant behavior may have been fixed by
this patch:

2005-07-23 20:33 tgl

* src/backend/postmaster/pgstat.c: Fix some failures to initialize
table entries induced by recent autovacuum integration. Not clear
this explains recent stats problems, but it's definitely wrong.

but it's not certain since nobody traced through the code to exhibit
why those uninitialized table entries would have led to this particular
visible symptom. But with no occurrences of that behavior since the
patch went in, I suspect it's fixed.

What we are left with turns out to be multiple occurrences of the first
pathology on exactly three buildfarm members:

ferret Cygwin
kudu Solaris 9, x86
dragonfly Solaris 9, x86

There are no occurrences of the failure on the native-Windows machines,
nor on buzzard (Solaris 10, SPARC), nor on gerbil (Solaris 9, SPARC)
(though gerbil has one old occurrence of the second pathology, so maybe
that observation should be taken with a grain of salt). And none
whatever on any other buildfarm member.

The same three machines are showing the failure in the 8.0 branch, too,
so it's not a recently-introduced issue.

And one thing more: kudu and dragonfly are actually the same machine,
same OS, different compilers.

So what to make of this? Dunno, but it is clearly a very
platform-specific behavior. Anyone see a connection between Cygwin
and Solaris?

regards, tom lane

#2Kris Jurka
books@ejurka.com
In reply to: Tom Lane (#1)
Re: Intermittent stats test failures on buildfarm

On Tue, 30 Aug 2005, Tom Lane wrote:

What we are left with turns out to be multiple occurrences of the first
pathology on exactly three buildfarm members:

ferret Cygwin
kudu Solaris 9, x86
dragonfly Solaris 9, x86

So what to make of this? Dunno, but it is clearly a very
platform-specific behavior. Anyone see a connection between Cygwin
and Solaris?

One thing to note about kudu and dragonfly is that they are running under
vmware. This, combined with cygwin's reputation, makes me suspect that
the connection is that they are both struggling under load. Although
canary (NetBSD 1.6 x86) is setup in the same fashion and has shown no such
failures.

I'm also in the process of moving, so I put this machine in a box last
night and it won't be up and running for a week or two. I do have very
similar copies of the OS image running on other machines if you'd like me
to test something specific.

Kris Jurka

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#2)
Re: Intermittent stats test failures on buildfarm

Kris Jurka <books@ejurka.com> writes:

On Tue, 30 Aug 2005, Tom Lane wrote:

What we are left with turns out to be multiple occurrences of the first
pathology on exactly three buildfarm members:

ferret Cygwin
kudu Solaris 9, x86
dragonfly Solaris 9, x86

So what to make of this? Dunno, but it is clearly a very
platform-specific behavior. Anyone see a connection between Cygwin
and Solaris?

One thing to note about kudu and dragonfly is that they are running under
vmware. This, combined with cygwin's reputation, makes me suspect that
the connection is that they are both struggling under load. Although
canary (NetBSD 1.6 x86) is setup in the same fashion and has shown no such
failures.

Hmm. One pretty obvious explanation of the failure is simply that the
machine is so loaded that the stats collector doesn't get to run for a
few seconds. I had dismissed this idea because I figured the buildfarm
machine owners would schedule the tests to run at relatively low-load
times of day ... but maybe that's not true on these two machines?

We could try increasing the delay in the stats test, say from two
seconds to five. If it is just a matter of load, that should result
in a very large drop in the frequency of the failure.

regards, tom lane

#4Rocco Altier
RoccoA@Routescape.com
In reply to: Tom Lane (#3)
Re: Intermittent stats test failures on buildfarm

Also, kookaburra (AIX) has a problem with the stats test as well.

What is most puzzling to me is that it only happens with cc (not gcc).
And I can only get it to happen when running a cronjob for the
buildfarm. If I run it interactively, the stats collector will run
fine, or if I run the build script from the command line.

The environment between cron and from command line are not significantly
different, so I am at a bit of loss as to the reason why.

Any thoughts?

-rocco

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Tuesday, August 30, 2005 12:31 AM
To: pgsql-hackers@postgreSQL.org
Subject: [HACKERS] Intermittent stats test failures on buildfarm

I just spent a tedious hour digging through the buildfarm results
to see what I could learn about the intermittent failures we're seeing
in the stats regression test, such as here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=ferret&amp;dt=20
05-05-29%2018:25:09
This is seen in both Check and InstallCheck steps. A variant
pathology
is seen here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&amp;dt=20
05-07-22%2007:58:01
Notice that only the heap stats columns are wrong in this
case, not the
index stats. I think that this variant behavior may have
been fixed by
this patch:

2005-07-23 20:33 tgl

* src/backend/postmaster/pgstat.c: Fix some failures to
initialize
table entries induced by recent autovacuum integration.
Not clear
this explains recent stats problems, but it's definitely wrong.

but it's not certain since nobody traced through the code to exhibit
why those uninitialized table entries would have led to this
particular
visible symptom. But with no occurrences of that behavior since the
patch went in, I suspect it's fixed.

What we are left with turns out to be multiple occurrences of
the first
pathology on exactly three buildfarm members:

ferret Cygwin
kudu Solaris 9, x86
dragonfly Solaris 9, x86

There are no occurrences of the failure on the native-Windows
machines,
nor on buzzard (Solaris 10, SPARC), nor on gerbil (Solaris 9, SPARC)
(though gerbil has one old occurrence of the second
pathology, so maybe
that observation should be taken with a grain of salt). And none
whatever on any other buildfarm member.

The same three machines are showing the failure in the 8.0
branch, too,
so it's not a recently-introduced issue.

And one thing more: kudu and dragonfly are actually the same machine,
same OS, different compilers.

So what to make of this? Dunno, but it is clearly a very
platform-specific behavior. Anyone see a connection between Cygwin
and Solaris?

regards, tom lane

---------------------------(end of
broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Rocco Altier (#4)
Re: Intermittent stats test failures on buildfarm

"Rocco Altier" <RoccoA@Routescape.com> writes:

Also, kookaburra (AIX) has a problem with the stats test as well.

kookaburra's problem is entirely different, not intermittent in the
least. The error diff shows that stats collection is off, and its
postmaster log says

LOG: could not bind socket for statistics collector: Permission denied
LOG: disabling statistics collector for lack of working socket

I have no idea what's causing that --- the only reason I know of for
EACCES from bind() is trying to bind to a privileged port number, and
one hopes we're not doing that.

regards, tom lane

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#5)
Re: Intermittent stats test failures on buildfarm

Tom Lane wrote:

"Rocco Altier" <RoccoA@Routescape.com> writes:

Also, kookaburra (AIX) has a problem with the stats test as well.

kookaburra's problem is entirely different, not intermittent in the
least. The error diff shows that stats collection is off, and its
postmaster log says

LOG: could not bind socket for statistics collector: Permission denied
LOG: disabling statistics collector for lack of working socket

I have no idea what's causing that --- the only reason I know of for
EACCES from bind() is trying to bind to a privileged port number, and
one hopes we're not doing that.

The other things that's rather odd is that it's failing at the
installcheck stage, which means it just passed this same test moments
before are the check stage. Installcheck failures in buildfarm should
always be regarded suspiciously.

cheers

andrew