Intermittent "make check" failures on hyena

Started by Tom Laneover 19 years ago10 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I'm noticing that buildfarm member hyena sometimes fails the parallel
regression tests, for instance
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=hyena&dt=2006-07-19%2009:20:00

The symptom is always that one of the tests fails entirely because
psql couldn't connect:

psql: could not connect to server: Connection refused
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.55678"?

It's a different test failing in each occurrence. Sometimes there are
ensuing failures in subsequent tests that expect the side-effects
of the one that failed, but there's clearly a common cause here.

AFAIK it is not possible for Postgres itself to cause a "connection
refused" failure --- that's a kernel-level errno. So what's going on
here? The only idea that comes to mind is that this version of Solaris
has some very low limit on SOMAXCONN, and when the timing is just so
it's bouncing connection requests because several of them arrive faster
than the postmaster can fork off children. Googling suggests that there
are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other
pages say that the default is higher, so this theory might be wrong.

What is SOMAXCONN set to on that box, anyway? If it's tiny, I suggest
you increase SOMAXCONN to something saner, or if you can't, run "make
check" with MAX_CONNECTIONS=5 added to the make command. Does the
buildfarm script have provisions for site-local settings of this
parameter?

regards, tom lane

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: Intermittent "make check" failures on hyena

Tom Lane wrote:

I'm noticing that buildfarm member hyena sometimes fails the parallel
regression tests, for instance
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=hyena&dt=2006-07-19%2009:20:00

The symptom is always that one of the tests fails entirely because
psql couldn't connect:

psql: could not connect to server: Connection refused
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.55678"?

It's a different test failing in each occurrence. Sometimes there are
ensuing failures in subsequent tests that expect the side-effects
of the one that failed, but there's clearly a common cause here.

AFAIK it is not possible for Postgres itself to cause a "connection
refused" failure --- that's a kernel-level errno. So what's going on
here? The only idea that comes to mind is that this version of Solaris
has some very low limit on SOMAXCONN, and when the timing is just so
it's bouncing connection requests because several of them arrive faster
than the postmaster can fork off children. Googling suggests that there
are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other
pages say that the default is higher, so this theory might be wrong.

What is SOMAXCONN set to on that box, anyway? If it's tiny, I suggest
you increase SOMAXCONN to something saner, or if you can't, run "make
check" with MAX_CONNECTIONS=5 added to the make command. Does the
buildfarm script have provisions for site-local settings of this
parameter?

Yes it sure does.

This is the box that Sun donated, btw.

I get: ndd /dev/tcp tcp_conn_req_max_q => 128

Is that the Solaris equivalent of SOMAXCONN? That's low, maybe, but not
impossibly low.

I don't have root on the box, though. For now I have set MAX_CONNECTIONS
to 8, to provide a modest limit on parallelism. I will see if I can
coordinate with Robert and Josh to increase the OS limits.

Thanks for the diagnosis.

cheers

andrew

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#2)
Re: Intermittent "make check" failures on hyena

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

AFAIK it is not possible for Postgres itself to cause a "connection
refused" failure --- that's a kernel-level errno. So what's going on
here? The only idea that comes to mind is that this version of Solaris
has some very low limit on SOMAXCONN, and when the timing is just so
it's bouncing connection requests because several of them arrive faster
than the postmaster can fork off children. Googling suggests that there
are versions of Solaris with SOMAXCONN as low as 5 :-( ... but other
pages say that the default is higher, so this theory might be wrong.

This is the box that Sun donated, btw.
I get: ndd /dev/tcp tcp_conn_req_max_q => 128
Is that the Solaris equivalent of SOMAXCONN? That's low, maybe, but not
impossibly low.

Yeah, I found that variable name in googling. If it's 128 then there's
no way that it's causing the problem --- you'd have to assume a value in
the single digits to explain the observed failures.

I see one occurrence in the 8.1 branch on hyena, but the failure
probability seems to have jumped way up in HEAD since we put in the
C-coded pg_regress. This lends weight to the idea that it's a
timing-related issue, because pg_regress.c is presumably much faster
at forking off a parallel gang of psqls than the shell script was;
and it's hard to see what else about the pg_regress change could be
affecting the psqls' ability to connect once forked.

We probably need to get some Solaris experts involved in diagnosing
what's happening. Judging by the buildfarm results you should be able
to replicate it fairly easily by doing "make installcheck-parallel"
repeatedly.

regards, tom lane

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#3)
Re: Intermittent "make check" failures on hyena

Tom Lane wrote:

I see one occurrence in the 8.1 branch on hyena, but the failure
probability seems to have jumped way up in HEAD since we put in the
C-coded pg_regress. This lends weight to the idea that it's a
timing-related issue, because pg_regress.c is presumably much faster
at forking off a parallel gang of psqls than the shell script was;
and it's hard to see what else about the pg_regress change could be
affecting the psqls' ability to connect once forked.

We probably need to get some Solaris experts involved in diagnosing
what's happening. Judging by the buildfarm results you should be able
to replicate it fairly easily by doing "make installcheck-parallel"
repeatedly.

I will refer this to those experts - my Solaris-fu is a tad rusty these
days.

cheers

andrew

#5Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Andrew Dunstan (#4)
Re: Intermittent "make check" failures on hyena

Andrew Dunstan wrote:

Tom Lane wrote:

I see one occurrence in the 8.1 branch on hyena, but the failure
probability seems to have jumped way up in HEAD since we put in the
C-coded pg_regress. This lends weight to the idea that it's a
timing-related issue, because pg_regress.c is presumably much faster
at forking off a parallel gang of psqls than the shell script was;
and it's hard to see what else about the pg_regress change could be
affecting the psqls' ability to connect once forked.

We probably need to get some Solaris experts involved in diagnosing
what's happening. Judging by the buildfarm results you should be able
to replicate it fairly easily by doing "make installcheck-parallel"
repeatedly.

I will refer this to those experts - my Solaris-fu is a tad rusty these
days.

How Tom mentioned, problem is in the size of TCP connection queue
(parameter tcp_conn_req_max_q). Default is 128 in solaris 10. Second
limit is twice number of backends. See ./backend/libpq/pqcomm.c

/*
* Select appropriate accept-queue length limit.
PG_SOMAXCONN is only
* intended to provide a clamp on the request on
platforms where an
* overly large request provokes a kernel error (are
there any?).
*/
maxconn = MaxBackends * 2;
if (maxconn > PG_SOMAXCONN)
maxconn = PG_SOMAXCONN;

err = listen(fd, maxconn);

However what happened? I think that following scenarios occurred.
Postmaster listen only in one process and there are many clients run
really parallel. T2000 server has 32 threads ( 8 core and each has 4
threads). These clients generate more TCP/IP request at one time, than
postmaster is able accepted.

Zdenek

#6Josh Berkus
josh@agliodbs.com
In reply to: Zdenek Kotala (#5)
Re: Intermittent "make check" failures on hyena

Zdenek,

However what happened? I think that following scenarios occurred.
Postmaster listen only in one process and there are many clients run
really parallel. T2000 server has 32 threads ( 8 core and each has 4
threads). These clients generate more TCP/IP request at one time, than
postmaster is able accepted.

I don't quite follow this ... are you saying that the regression test
generate more than 128 connections? And that Solaris ships by default
only allowing 128 connections?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#7Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Josh Berkus (#6)
Re: Intermittent "make check" failures on hyena

Josh Berkus napsal(a):

Zdenek,

However what happened? I think that following scenarios occurred.
Postmaster listen only in one process and there are many clients run
really parallel. T2000 server has 32 threads ( 8 core and each has 4
threads). These clients generate more TCP/IP request at one time, than
postmaster is able accepted.

I don't quite follow this ... are you saying that the regression test
generate more than 128 connections? And that Solaris ships by default
only allowing 128 connections?

My idea is completely wrong. The problem is not related to TCP/IP stack.
It related only to UNIX_AF sockets. I have tried analyze problem with
dtrace, but I cannot reproduce this on my machine. Josh could I have
access to this server (for D-Trace I need root access as well).

Zdenek

#8Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Josh Berkus (#6)
Re: Intermittent "make check" failures on hyena

Josh Berkus wrote:

Zdenek,

However what happened? I think that following scenarios occurred.
Postmaster listen only in one process and there are many clients run
really parallel. T2000 server has 32 threads ( 8 core and each has 4
threads). These clients generate more TCP/IP request at one time, than
postmaster is able accepted.

I don't quite follow this ... are you saying that the regression test
generate more than 128 connections? And that Solaris ships by default
only allowing 128 connections?

Default is 128 for queue of TCP established socket, but not accepted. It
is called backlog (see man listen). This values is possible change by
ndd command. However problem is, that unix domain socket has different
backlog limit. This limit is 32 waiting request for accept and this
value is not possible setup. However, kernel patch 118855-15 increase
this value up to 128 - see bugid 4352289. New version of
solaris/opensolaris has bigger value - 4096.

Please, install latest solaris 10 patches on this machine and check if
it helps.

Zdenek

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Zdenek Kotala (#8)
Re: Intermittent "make check" failures on hyena

Zdenek Kotala wrote:

Josh Berkus wrote:

Zdenek,

However what happened? I think that following scenarios occurred.
Postmaster listen only in one process and there are many clients run
really parallel. T2000 server has 32 threads ( 8 core and each has 4
threads). These clients generate more TCP/IP request at one time, than
postmaster is able accepted.

I don't quite follow this ... are you saying that the regression test
generate more than 128 connections? And that Solaris ships by
default only allowing 128 connections?

Default is 128 for queue of TCP established socket, but not accepted.
It is called backlog (see man listen). This values is possible change
by ndd command. However problem is, that unix domain socket has
different backlog limit. This limit is 32 waiting request for accept
and this value is not possible setup. However, kernel patch 118855-15
increase this value up to 128 - see bugid 4352289. New version of
solaris/opensolaris has bigger value - 4096.

Please, install latest solaris 10 patches on this machine and check if
it helps.

Even at 32 this hardly seems to be a likely cause of the problem. I
think the maximum parallelism of our tests is around 20. Anyway, lets's
get the patch installed - I have a test regime set up that will
reproduce the error moderately reliably within about a dozen or so tries.

cheers

andrew

#10Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Andrew Dunstan (#9)
Re: Intermittent "make check" failures on hyena

Andrew Dunstan napsal(a):

Even at 32 this hardly seems to be a likely cause of the problem. I
think the maximum parallelism of our tests is around 20. Anyway, lets's
get the patch installed - I have a test regime set up that will
reproduce the error moderately reliably within about a dozen or so tries.

I deeply analyzed this problem and it is problem in the local transport
layer modul - tl driver in the solaris kernel. It is kind of race
condition and it appears when there are parallel request to close and
connect. I raised bug. Reconnect should be work around.

Zdenek